Closed FlxVctr closed 2 years ago
Hi,
Trafilatura is pretty much something like "NewsPlease" or "Scrapy" (I guess :D) which have both been covered in our General News Scraper Section. The Scraper is lightweight and only scrapes the html code and sorts its elements. I don't know if thats enough for it to be added.
In my eyes, we could add it as a "Blog" scraper or similar. But it wouldnt function on more complex sites fluently :). I tryed using it on NYT and SPON and did scrape the article but when it comes to scrolling for more comments or the full text of a article (on some sites you have to scroll to get more text of a site, or the article continues on a second page) it will fail occasionally. On some random wordpress sites it worked fine and also on certain sites of SPON. It is very very dependent to the structure of the Website and will be, imo, hard to explain to people when and where to use this tool!
htmldate: This module can be very useful when programming a general news scraper. But doesn't ‚timedate‘ have a package called dateutil that pretty much does the same?
Hi, I just stambled upon this issue, thanks for your interest! Your wiki page seems quite relevant.
I can answer some of your concerns: trafilatura
has gotten better since your test, both in terms of accuracy and functionality.
With a colleague of mine we also did some research on text extraction and found out that the chosen solution depends highly on variables such as language or country of publication as well as text type:
trafilatura
(in French but the tables should be readable, a publication in English will follow)As for htmldate
, the package does a lot more than just dateutil
, most notably smart and/or exhaustive searches through metadata and text (if required). It performs much better than comparable algorithms on my test data (mostly German webpages), please refer to this evaluation.
I completely agree with you, it's hard to know when and where to use the tools. That being said, it seems there are more efficient tools than newspaper
or news-please
at the moment, especially for languages other than English.
@adbar: Thx for bringing those evaluations to our attention. We'll have a second look into it. Also, feel free to propose changes/other additions to the News Scraper wiki page.
@rwinterschlaf Can you please have a look how it performs on our online media lists? Just with a small sample.
Task has been moved into issue https://github.com/Leibniz-HBI/SMO_PM/issues/7. Closing this issue again :)
Presented during lunch lecture 11. Dec 2019:
https://github.com/adbar/trafilatura Web scraping library: downloads web pages, finds main text and comments, converts to TXT, XML & TEI http://adrien.barbaresi.eu/research.html
https://github.com/adbar/htmldate Find the creation date of web pages using common structural patterns, text-based heuristics and robust date extraction