JointEntropy / html-extractor

0 stars 0 forks source link

Existing solutions #1

Open JointEntropy opened 3 years ago

JointEntropy commented 3 years ago

https://stackoverflow.com/questions/1962389/what-is-the-state-of-the-art-in-html-content-extraction

Links: https://pypi.org/project/boilerpy3/ https://github.com/kohlschutter/boilerpipe https://github.com/rodricios/eatiht

https://www.guru99.com/web-scraping-tools.html https://github.com/dragnet-org/dragnet https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-dragnet-readability-goose-and-eatiht https://github.com/jiminoc/goose https://github.com/grangier/python-goose https://github.com/mozilla/readability http://www2013.w3c.br/companion/p89.pdf

Demo: http://boilerpipe-web.appspot.com/ http://jimplush.com/blog/goose http://juicer.herokuapp.com/

Papers and presentations: http://www.l3s.de/~kohlschuetter/boilerplate/ https://www.researchgate.net/publication/220072597_ViDE_A_Vision-Based_Approach_for_Deep_Web_Data_Extraction https://www.researchgate.net/profile/Hamza_Aldabbas/publication/344711925_An_Efficient_Mechanism_for_Product_Data_Extraction_from_E-Commerce_Websites/links/5f8ae2b3a6fdccfd7b65b123/An-Efficient-Mechanism-for-Product-Data-Extraction-from-E-Commerce-Websites.pdf

Datasets: L3S-GN1 dataset

JointEntropy commented 3 years ago

https://www.semantics3.com/blog/ai-for-automated-web-crawling/

JointEntropy commented 3 years ago

https://www.researchgate.net/publication/220072597_ViDE_A_Vision-Based_Approach_for_Deep_Web_Data_Extraction