deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

epub parser: separate text blocks of logical elements by "Form Feed" #327

Open workflowsguy opened 4 years ago

workflowsguy commented 4 years ago

When combining the text read from individual book elements of an epub file, those elements are currently separated only by an '\n' character.

I suggest separating them by a '\f' character instead. This would be analogous to current text extraction from PDF files, where the "logical elements" "individual pages" are also separated by a Form Feed.

This would help to maintain at least some kind of structure of the original file in the resulting txt file and thus make parsing the logical structure possible.