flairNLP / fundus

A very simple news crawler with a funny name
MIT License
298 stars 75 forks source link

Example project: Corpus of political bias #54

Closed alanakbik closed 9 months ago

alanakbik commented 1 year ago

It would be great to add some crawlers for English-language news to the library (once the guidelines are finalized).

For another research project on detecting political bias and reliability, we require crawlers for the following 12 sources:

alanakbik commented 1 year ago

For each source, we should gather at least 200 articles.

MaxDall commented 1 year ago

To verify #239 I did a little test crawl with ~20000 articles on PublisherCollection.us. The articles are in JSON format with the responded URL as the key. They consist of publishing_date, title, body, authors, topics with publishing_date, title, body being non-optional. Sadly I forgot to save the logs as well so no news on that side. Maybe someone wanna have a look at the data.

alanakbik commented 1 year ago

@MaxDall the link is not reachable.

@lukasgarbas can you take a look at the data? Is it better suited for your project?

MaxDall commented 1 year ago

@lukasgarbas @alanakbik This one should work.

alanakbik commented 1 year ago

Yes that works, thanks! @lukasgarbas can you take a look?

lukasgarbas commented 1 year ago

@MaxDall I looked at the data, here are some statistics:

APNEWS num articles: 2540
REUTERS num articles: 2800
CNBC num articles: 2006
THENATION num articles: 1944
INTERCEPT num articles: 0
NEWYORKER num articles: 2160
FOXNEWS num articles: 2830
FREEBEACON num articles: 42
WASHINGTONTIMES num articles: 2829
WORLDTRUTH num articles: 10
THEGATEWAYPUNDIT num articles: 2825

It would be good to have more articles for The Intercept, Freebeacon and Worldtruth.

I also looked at the quality of extracted text and found javascript code in some of the articles. Probably some improvements for specific parsers could be made. I will open a new issue for that with more details.

Weyaaron commented 1 year ago

In my opinion, this issue has been solved. Maybe we can move the data to a more readily available place and close the issue?