Leibniz-HBI / smo-wiki

Generates a Github Page from the Social Media Observatory Wiki with Bash, Python, Regexes and Jekyll.
https://smo-wiki.leibniz-hbi.de
4 stars 1 forks source link

Look into web scraping tools by Barbaresi #47

Closed FlxVctr closed 2 years ago

FlxVctr commented 4 years ago

Presented during lunch lecture 11. Dec 2019:

https://github.com/adbar/trafilatura Web scraping library: downloads web pages, finds main text and comments, converts to TXT, XML & TEI http://adrien.barbaresi.eu/research.html

https://github.com/adbar/htmldate Find the creation date of web pages using common structural patterns, text-based heuristics and robust date extraction

manilevian commented 4 years ago

Hi,

Trafilatura is pretty much something like "NewsPlease" or "Scrapy" (I guess :D) which have both been covered in our General News Scraper Section. The Scraper is lightweight and only scrapes the html code and sorts its elements. I don't know if thats enough for it to be added.

In my eyes, we could add it as a "Blog" scraper or similar. But it wouldnt function on more complex sites fluently :). I tryed using it on NYT and SPON and did scrape the article but when it comes to scrolling for more comments or the full text of a article (on some sites you have to scroll to get more text of a site, or the article continues on a second page) it will fail occasionally. On some random wordpress sites it worked fine and also on certain sites of SPON. It is very very dependent to the structure of the Website and will be, imo, hard to explain to people when and where to use this tool!

htmldate: This module can be very useful when programming a general news scraper. But doesn't ‚timedate‘ have a package called dateutil that pretty much does the same?

adbar commented 4 years ago

Hi, I just stambled upon this issue, thanks for your interest! Your wiki page seems quite relevant.

I can answer some of your concerns: trafilatura has gotten better since your test, both in terms of accuracy and functionality. With a colleague of mine we also did some research on text extraction and found out that the chosen solution depends highly on variables such as language or country of publication as well as text type:

As for htmldate, the package does a lot more than just dateutil, most notably smart and/or exhaustive searches through metadata and text (if required). It performs much better than comparable algorithms on my test data (mostly German webpages), please refer to this evaluation.

I completely agree with you, it's hard to know when and where to use the tools. That being said, it seems there are more efficient tools than newspaper or news-please at the moment, especially for languages other than English.

FlxVctr commented 4 years ago

@adbar: Thx for bringing those evaluations to our attention. We'll have a second look into it. Also, feel free to propose changes/other additions to the News Scraper wiki page.

FlxVctr commented 2 years ago

@rwinterschlaf Can you please have a look how it performs on our online media lists? Just with a small sample.

rwinterschlaf commented 2 years ago

Task has been moved into issue https://github.com/Leibniz-HBI/SMO_PM/issues/7. Closing this issue again :)