adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.52k stars 255 forks source link

Refactor code to provide a "keep-tags" option #52

Open adbar opened 3 years ago

adbar commented 3 years ago

Feature requests like in #38 and #48 deal with inclusion of particular HTML elements in the output.

To allow for easier inclusion and less hacky code it would be best to

adbar commented 3 years ago

<span>-tags are typically a candidate in order to preserve formatting info, for example <span color='blue'>.