CentreForDigitalHumanities / tscan

T-scan: an analysis tool for dutch texts to assess the complexity of the text, based on original work by Rogier Kraf
GNU Affero General Public License v3.0
18 stars 6 forks source link

Paragraph breaks in word to txt conversion #64

Open lukavdplas opened 1 year ago

lukavdplas commented 1 year ago

There are cases where word documents show paragraph breaks with a whiteline but the resulting txt file users a single \n character, so there are no whitelines between paragraphs. Using \n\n would match the original document better.

This seemingly happens when word makes whitelines between paragraphs with styling, so there is only a single return.

It is not an acceptable solution to replace every \n in the output with \n\n as this would cause unexpected results.