Add options to minimize parsed html text

deanmalmgren / textract

extract text from any document. no muss. no fuss.

http://textract.readthedocs.io

MIT License

3.89k stars 599 forks source link

Add options to minimize parsed html text #354

Open aleks-v-k opened 4 years ago

aleks-v-k commented 4 years ago

The PR is to add the possibility to minimize text extracted from HTML:

merge several space symbols to one
remove table formatting.

The main reason for these changes: the parser is OOM killed on some large html files (there are a lot of spaces + a table).

Would you like to accept such changes? If yes, then I will add tests to cover the new code.

traverseda commented 3 years ago

Hello, I've recently been made a maintainer of this project. I'd be interested in these changes. I'd also be interested in a selectolax-based text extractor if you were feeling adventurous.