adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.68k stars 263 forks source link

weird xml extraction #632

Closed fortyfourforty closed 5 months ago

fortyfourforty commented 5 months ago

Example url: https://www.dummies.com/article/home-auto-hobbies/home-improvement-appliances/electrical/how-to-replace-a-light-switch-185346/

Command: trafilatura.extract(page_source, output_format='xml', include_comments=False)

Problem: Output is not reading like a regular XML.

adbar commented 5 months ago

Yes, there is something wrong with the extraction here.

adbar commented 5 months ago

The main extractor is not impacted, readability_lxml extracts the wrong content, I will implement a quick fix.