Closed rwinterschlaf closed 1 year ago
The following reports by IVW and agof will be used to determine which online news media outlets in Germany can be considered "most popular". Listing them here for easy reference:
https://de.statista.com/statistik/daten/studie/165258/umfrage/reichweite-der-meistbesuchten-nachrichtenwebsites/ and https://de.statista.com/statistik/daten/studie/154154/umfrage/anzahl-der-visits-von-nachrichtenportalen/
Out of fifteen online newspapers (media groups have been excluded for now), only one has caused major issues (ZEIT). The Pur Abo function seems to bar scraping of the actual page(s), as the tool does not accept the Abo's terms and conditions. This is an issue that will need to be looked at in detail. @FlxVctr we should see if we can figure out a solution to this, though it might also be a thing on trafilatura's end.
Minor article issues persist through the extracted data (occasional missing texts for n-tv articles), but the majority of the data seems to be correct. Filtering of rubrics, as mentioned in https://github.com/Leibniz-HBI/newsfeedback/issues/5 will cut down the bulk of data significantly and show us what we are working with.
Remaining complication of this issue has moved to https://github.com/Leibniz-HBI/newsfeedback/issues/9, so closing this one accordingly.
Find out which news media outlets are most popular and make sure that their articles are picked up by the extraction. I am quite sure that they have RSS feeds so this issue shouldn't be too tricky to revolve.