Leibniz-HBI / newsfeedback

Tool for extracting and saving news article metadata (and optionally content) at regular intervals.
MIT License
3 stars 0 forks source link

Ensure tool functionality with "big" German news media outlets #3

Closed rwinterschlaf closed 1 year ago

rwinterschlaf commented 2 years ago

Find out which news media outlets are most popular and make sure that their articles are picked up by the extraction. I am quite sure that they have RSS feeds so this issue shouldn't be too tricky to revolve.

rwinterschlaf commented 1 year ago

The following reports by IVW and agof will be used to determine which online news media outlets in Germany can be considered "most popular". Listing them here for easy reference:

https://de.statista.com/statistik/daten/studie/165258/umfrage/reichweite-der-meistbesuchten-nachrichtenwebsites/ and https://de.statista.com/statistik/daten/studie/154154/umfrage/anzahl-der-visits-von-nachrichtenportalen/

rwinterschlaf commented 1 year ago

Out of fifteen online newspapers (media groups have been excluded for now), only one has caused major issues (ZEIT). The Pur Abo function seems to bar scraping of the actual page(s), as the tool does not accept the Abo's terms and conditions. This is an issue that will need to be looked at in detail. @FlxVctr we should see if we can figure out a solution to this, though it might also be a thing on trafilatura's end.

Minor article issues persist through the extracted data (occasional missing texts for n-tv articles), but the majority of the data seems to be correct. Filtering of rubrics, as mentioned in https://github.com/Leibniz-HBI/newsfeedback/issues/5 will cut down the bulk of data significantly and show us what we are working with.

rwinterschlaf commented 1 year ago

Remaining complication of this issue has moved to https://github.com/Leibniz-HBI/newsfeedback/issues/9, so closing this one accordingly.