Closed fortyfourforty closed 5 months ago
Example url: https://www.dummies.com/article/home-auto-hobbies/home-improvement-appliances/electrical/how-to-replace-a-light-switch-185346/
https://www.dummies.com/article/home-auto-hobbies/home-improvement-appliances/electrical/how-to-replace-a-light-switch-185346/
Command: trafilatura.extract(page_source, output_format='xml', include_comments=False)
trafilatura.extract(page_source, output_format='xml', include_comments=False)
Problem: Output is not reading like a regular XML.
Yes, there is something wrong with the extraction here.
The main extractor is not impacted, readability_lxml extracts the wrong content, I will implement a quick fix.
Example url:
https://www.dummies.com/article/home-auto-hobbies/home-improvement-appliances/electrical/how-to-replace-a-light-switch-185346/
Command:
trafilatura.extract(page_source, output_format='xml', include_comments=False)
Problem: Output is not reading like a regular XML.