dragnet-org / dragnet

Just the facts -- web page content extraction
MIT License
1.25k stars 179 forks source link

potential improvements / new features to the extraction model? #86

Open bdewilde opened 5 years ago

bdewilde commented 5 years ago

I was doing a quick lit review to see if/how the state-of-the-art in web content extraction had changed over the past few years, and came upon a conference paper from last September, Learning Web Content Extraction with DOM Features, that seems interesting, relevant, and performant. There's also code: see learnhtml. Is there any interest in implementing its feature set within dragnet, and evaluating model performance with such features? This could be related to updates proposed in Issue #85.

matt-peters commented 5 years ago

Anything that improves the performance is very welcome!

acertain commented 2 years ago

Another new package https://github.com/adbar/trafilatura see also https://github.com/scrapinghub/article-extraction-benchmark