flairNLP / fundus

A very simple news crawler with a funny name
MIT License
98 stars 59 forks source link

Add Support for "Washington Post" #467

Closed areinicke closed 1 week ago

areinicke commented 3 weeks ago

I have added support for the US-Publisher "Washington Post" (https://www.washingtonpost.com/)

I have ran the tests as instructed and no errors were produced.

addie9800 commented 2 weeks ago

You could consider also adding a function for topics, because within the json, theres a tag called keywords which would provide the necessary data

Everything you have implemented so far looks good. Now what still remains open is a function for the topics. If you add that you also need to run python -m scripts.generate_parser_test_files -p WashingtonPost -oj to update the test cases. (My guess is that this is why the tests are failing atm as well). After all of that make sure to also run black . to do any necessary reformatting.

areinicke commented 2 weeks ago

Unfortunately, I am unsure on how to specifically extract the values of the "keywords" tag with the methods Fundus provides or without causing the topics method to be huge. I have tried several options but was unsuccessful so far. An alternative would be to just extract the "article:section" value from the meta section. However, this would be extremely broad and only return one topic per article, which is not ideal.

Additionally, adding the additional RSS Feeds you provided seems to have caused the main page of the Washington Post ( https://www.washingtonpost.com/ ) to be considered as an article as well. When this occurs, no article text or publishing date is returned obviously. Fundus will say "--missing plaintext--"

In the meantime, I have fixed the tests. They should run fine now.

addie9800 commented 2 weeks ago

You are right, for some reason the RSS Feeds sometimes don't contain the actual link to the article and just lead to the homepage. I don't know why. For this we have the url_filter attribute and since it was just something small, I added it to the PR. I'm sorry regarding the keywords, because I also couldn't find what I found the last time and it does really not make sense adding them. Sorry about that :)