Look into web scraping tools by Barbaresi

FlxVctr commented 4 years ago

Presented during lunch lecture 11. Dec 2019:

https://github.com/adbar/trafilatura Web scraping library: downloads web pages, finds main text and comments, converts to TXT, XML & TEI http://adrien.barbaresi.eu/research.html

https://github.com/adbar/htmldate Find the creation date of web pages using common structural patterns, text-based heuristics and robust date extraction

manilevian commented 4 years ago

Hi,

Trafilatura is pretty much something like "NewsPlease" or "Scrapy" (I guess :D) which have both been covered in our General News Scraper Section. The Scraper is lightweight and only scrapes the html code and sorts its elements. I don't know if thats enough for it to be added.

In my eyes, we could add it as a "Blog" scraper or similar. But it wouldnt function on more complex sites fluently :). I tryed using it on NYT and SPON and did scrape the article but when it comes to scrolling for more comments or the full text of a article (on some sites you have to scroll to get more text of a site, or the article continues on a second page) it will fail occasionally. On some random wordpress sites it worked fine and also on certain sites of SPON. It is very very dependent to the structure of the Website and will be, imo, hard to explain to people when and where to use this tool!

htmldate: This module can be very useful when programming a general news scraper. But doesn't ‚timedate‘ have a package called dateutil that pretty much does the same?

adbar commented 4 years ago

Hi, I just stambled upon this issue, thanks for your interest! Your wiki page seems quite relevant.

I can answer some of your concerns: trafilatura has gotten better since your test, both in terms of accuracy and functionality. With a colleague of mine we also did some research on text extraction and found out that the chosen solution depends highly on variables such as language or country of publication as well as text type:

See here for an evaluation of text extraction with Python packages
This article documents an evaluation on a different, multilingual dataset
This one includes a recent benchmark with trafilatura (in French but the tables should be readable, a publication in English will follow)

As for htmldate, the package does a lot more than just dateutil, most notably smart and/or exhaustive searches through metadata and text (if required). It performs much better than comparable algorithms on my test data (mostly German webpages), please refer to this evaluation.

I completely agree with you, it's hard to know when and where to use the tools. That being said, it seems there are more efficient tools than newspaper or news-please at the moment, especially for languages other than English.

FlxVctr commented 4 years ago

@adbar: Thx for bringing those evaluations to our attention. We'll have a second look into it. Also, feel free to propose changes/other additions to the News Scraper wiki page.

FlxVctr commented 2 years ago

@rwinterschlaf Can you please have a look how it performs on our online media lists? Just with a small sample.

rwinterschlaf commented 2 years ago

Task has been moved into issue https://github.com/Leibniz-HBI/SMO_PM/issues/7. Closing this issue again :)

Leibniz-HBI / smo-wiki

Look into web scraping tools by Barbaresi #47