Closed alanakbik closed 9 months ago
For each source, we should gather at least 200 articles.
To verify #239 I did a little test crawl with ~20000 articles on PublisherCollection.us
. The articles are in JSON format with the responded URL as the key. They consist of publishing_date, title, body, authors, topics
with publishing_date, title, body
being non-optional. Sadly I forgot to save the logs as well so no news on that side. Maybe someone wanna have a look at the data.
@MaxDall the link is not reachable.
@lukasgarbas can you take a look at the data? Is it better suited for your project?
Yes that works, thanks! @lukasgarbas can you take a look?
@MaxDall I looked at the data, here are some statistics:
APNEWS num articles: 2540
REUTERS num articles: 2800
CNBC num articles: 2006
THENATION num articles: 1944
INTERCEPT num articles: 0
NEWYORKER num articles: 2160
FOXNEWS num articles: 2830
FREEBEACON num articles: 42
WASHINGTONTIMES num articles: 2829
WORLDTRUTH num articles: 10
THEGATEWAYPUNDIT num articles: 2825
It would be good to have more articles for The Intercept, Freebeacon and Worldtruth.
I also looked at the quality of extracted text and found javascript code in some of the articles. Probably some improvements for specific parsers could be made. I will open a new issue for that with more details.
In my opinion, this issue has been solved. Maybe we can move the data to a more readily available place and close the issue?
It would be great to add some crawlers for English-language news to the library (once the guidelines are finalized).
For another research project on detecting political bias and reliability, we require crawlers for the following 12 sources: