Example project: Corpus of political bias

flairNLP / fundus

A very simple news crawler with a funny name

MIT License

298 stars 75 forks source link

Example project: Corpus of political bias #54

Closed alanakbik closed 9 months ago

alanakbik commented 1 year ago

It would be great to add some crawlers for English-language news to the library (once the guidelines are finalized).

For another research project on detecting political bias and reliability, we require crawlers for the following 12 sources:

[x] AP News (Assigned: @dobbersc)
[x] Reuters (Assigned: @dobbersc PR open #179)
[x] CNBC (Assigned: @dobbersc)
[x] The Nation (Assigned: @Weyaaron PR open #160)
[x] The Intercept (Assigned: @dobbersc)
[x] The New Yorker (Assigned: @dobbersc PR open #176)
[x] Fox News (Assigned: @Weyaaron)
[x] The Washington Free Beacon (Assigned: @Weyaaron)
[x] Washington Times (Assigned: @Weyaaron)
[x] Occupy Democrats (Assigned: @dobbersc paused until #178 is closed)
[x] World Truth TV (Assigned: @Weyaaron)
[x] The Gateway Pundit (Assigned: @dobbersc)

alanakbik commented 1 year ago

For each source, we should gather at least 200 articles.

MaxDall commented 1 year ago

To verify #239 I did a little test crawl with ~20000 articles on PublisherCollection.us. The articles are in JSON format with the responded URL as the key. They consist of publishing_date, title, body, authors, topics with publishing_date, title, body being non-optional. Sadly I forgot to save the logs as well so no news on that side. Maybe someone wanna have a look at the data.

alanakbik commented 1 year ago

@MaxDall the link is not reachable.

@lukasgarbas can you take a look at the data? Is it better suited for your project?

MaxDall commented 1 year ago

@lukasgarbas @alanakbik This one should work.

alanakbik commented 1 year ago

Yes that works, thanks! @lukasgarbas can you take a look?

lukasgarbas commented 1 year ago

@MaxDall I looked at the data, here are some statistics:

APNEWS num articles: 2540
REUTERS num articles: 2800
CNBC num articles: 2006
THENATION num articles: 1944
INTERCEPT num articles: 0
NEWYORKER num articles: 2160
FOXNEWS num articles: 2830
FREEBEACON num articles: 42
WASHINGTONTIMES num articles: 2829
WORLDTRUTH num articles: 10
THEGATEWAYPUNDIT num articles: 2825

It would be good to have more articles for The Intercept, Freebeacon and Worldtruth.

I also looked at the quality of extracted text and found javascript code in some of the articles. Probably some improvements for specific parsers could be made. I will open a new issue for that with more details.

Weyaaron commented 1 year ago

In my opinion, this issue has been solved. Maybe we can move the data to a more readily available place and close the issue?