flairNLP / fundus

A very simple news crawler with a funny name
MIT License
98 stars 59 forks source link

Add new publisher “The Mirror” #466

Closed TingC99 closed 1 week ago

TingC99 commented 3 weeks ago

Hey, I'm trying to add new publishers "The Mirror", but I'm having trouble running this part: python -m scripts.generate_parser_test_files -p TheMirror He keeps showing 0% as if he's stuck and not making any progress

TheMirror: 0%| | 0/1 [00:00<?, ?it/s]

Maybe anyone has any good ideas?

addie9800 commented 3 weeks ago

Hi, thanks for adding the mirror :). From what I can see you are still missing a function to extract the topics. The Mirror seems to support the meta tag keywords and news_keywords. Also, the publishing date is not extracted. For this you can save yourself some trouble and use the meta tag parsely-pub-date. Furthermore the actual content also does not seem to be extracted. The script won't run properly because it's set up to get an article that has all attributes you implemented. If for example the publishing date is missing, it will skip that article and try the next one. That has been happening over and over for you. I would recommend running this script


crawler = Crawler(PublisherCollection.uk.TheMirror)

for article in crawler.crawl(max_articles=30, only_complete=False):
    print(article.title)
    print(article.html.responded_url)
    print(article.publishing_date)
    print(article.authors)
    print(article.topics)
    print(article.plaintext)
    print("------ New Article ------:\n")

to verify your progess.

TingC99 commented 2 weeks ago

Thank you for your advice. I have added topics and time. But I also encountered this error: `ValueError: Invalid isoformat string: '2024-04-28T13:00:00Z'

But theoretically this should be in ISO 8601 format.

addie9800 commented 2 weeks ago

So, as it turns out I misunderstood you. The parsing error occurs when running pytest and generating the test files, not when running fundus itself. The reason this was happeing is that we were using datetime.datetime.fromisoformat() in the backend, which did not like the JavaScript default of adding Z to indicate a timezone time difference. Using dateutils, this problem is solved. I took the liberty of doing some minor changes as well :) Thanks a lot for adding this.

MaxDall commented 2 weeks ago

@addie9800 It seems like someone added the date by hand, otherwise I can't think about how the Z ended up there. I overwrote the test case.

addie9800 commented 2 weeks ago

Yeah, you're right. That was sort of weird, this closes #505 though. It seemed a lot like that was the cause.