adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.35k stars 245 forks source link

Footer removal #617

Closed hamsarajan closed 1 month ago

hamsarajan commented 2 months ago

I've encountered an issue when extracting text in Vietnamese. It appears that a portion of the footer of a webpage is still being extracted. An example would be this article: https://myrica.vn/. Would appreciate your help in this matter!

Snippet of the Vietnamese text extracted: Nôi cũi Ghế ngồi ô tô Mua ở đâu Đăng ký serial Xe đẩy trẻ em Myrica ALB1VN - hồng phấn Xe đẩy trẻ em Myrica ALB1VN - xanh mint Xe đẩy trẻ em Myrica ALB1VN - ghi xám Tại sao chọn Công ty TNHH ĐT Và TM QT Thanh Mai Đỏ Số 32 Yên Ninh, Trúc Bạch, Ba Đình, Hà Nội Ngày cấp: 27/11/2018, Sở KHĐTHN Kết nối: Bảo hành / Đổi trả hàng Câu hỏi thường gặp FAQ

Tầm nhìn, sứ mệnh
Bản quyền thuộc về Myrica co., ltd

Translation: Crib Car seats Where do you buy it Register serial Myrica ALB1VN baby stroller - pink Myrica ALB1VN baby stroller - mint green Myrica ALB1VN baby stroller - gray Why choose Thanh Mai Do International Investment and Trading Company Limited No. 32 Yen Ninh, Truc Bach, Ba Dinh, Hanoi Date of issue: November 27, 2018, Department of Planning and Development Connect: Warranty / Returns Frequently asked questions FAQ

Vision, mission
Copyright belongs to Myrica co., ltd
adbar commented 2 months ago

Trafilatura does not work as well on index pages and catalogs because there is no main article to extract. That being said there is an issue here, all extraction methods fail and all the text on the page is then in the output.