Closed areinicke closed 1 week ago
You could consider also adding a function for topics, because within the json, theres a tag called keywords which would provide the necessary data
Everything you have implemented so far looks good. Now what still remains open is a function for the topics. If you add that you also need to run python -m scripts.generate_parser_test_files -p WashingtonPost -oj
to update the test cases. (My guess is that this is why the tests are failing atm as well). After all of that make sure to also run black .
to do any necessary reformatting.
Unfortunately, I am unsure on how to specifically extract the values of the "keywords" tag with the methods Fundus provides or without causing the topics method to be huge. I have tried several options but was unsuccessful so far. An alternative would be to just extract the "article:section" value from the meta section. However, this would be extremely broad and only return one topic per article, which is not ideal.
Additionally, adding the additional RSS Feeds you provided seems to have caused the main page of the Washington Post ( https://www.washingtonpost.com/ ) to be considered as an article as well. When this occurs, no article text or publishing date is returned obviously. Fundus will say "--missing plaintext--"
In the meantime, I have fixed the tests. They should run fine now.
You are right, for some reason the RSS Feeds sometimes don't contain the actual link to the article and just lead to the homepage. I don't know why. For this we have the url_filter attribute and since it was just something small, I added it to the PR. I'm sorry regarding the keywords, because I also couldn't find what I found the last time and it does really not make sense adding them. Sorry about that :)
I have added support for the US-Publisher "Washington Post" (https://www.washingtonpost.com/)
I have ran the tests as instructed and no errors were produced.