Closed TingC99 closed 1 week ago
Hi, thanks for adding the mirror :). From what I can see you are still missing a function to extract the topics. The Mirror seems to support the meta tag keywords
and news_keywords
. Also, the publishing date is not extracted. For this you can save yourself some trouble and use the meta tag parsely-pub-date
. Furthermore the actual content also does not seem to be extracted. The script won't run properly because it's set up to get an article that has all attributes you implemented. If for example the publishing date is missing, it will skip that article and try the next one. That has been happening over and over for you. I would recommend running this script
crawler = Crawler(PublisherCollection.uk.TheMirror)
for article in crawler.crawl(max_articles=30, only_complete=False):
print(article.title)
print(article.html.responded_url)
print(article.publishing_date)
print(article.authors)
print(article.topics)
print(article.plaintext)
print("------ New Article ------:\n")
to verify your progess.
Thank you for your advice. I have added topics and time. But I also encountered this error: `ValueError: Invalid isoformat string: '2024-04-28T13:00:00Z'
But theoretically this should be in ISO 8601 format.
So, as it turns out I misunderstood you. The parsing error occurs when running pytest and generating the test files, not when running fundus itself. The reason this was happeing is that we were using datetime.datetime.fromisoformat()
in the backend, which did not like the JavaScript default of adding Z
to indicate a timezone time difference. Using dateutils, this problem is solved. I took the liberty of doing some minor changes as well :) Thanks a lot for adding this.
@addie9800 It seems like someone added the date by hand, otherwise I can't think about how the Z
ended up there. I overwrote the test case.
Yeah, you're right. That was sort of weird, this closes #505 though. It seemed a lot like that was the cause.
Hey, I'm trying to add new publishers "The Mirror", but I'm having trouble running this part:
python -m scripts.generate_parser_test_files -p TheMirror
He keeps showing 0% as if he's stuck and not making any progressMaybe anyone has any good ideas?