deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Text cut every 80 characters in .doc files #367

Open vesran opened 3 years ago

vesran commented 3 years ago

I am trying to read a .doc file using textract still every 80 characters, a \n is inserted when the document is read while it is not in the document.

To Reproduce file = "dev/test_textract_80.doc" # path to file text = textract.process(file).decode('utf-8')

where the .doc file contains 0_1_2_3_4_..._98_99_ (every number from 0 to 99 separated with an underscore)

Expected behavior Expected : text -> 0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30...

Current output : text -> 0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28\n_29_30... (notice the \n before _29)

Desktop

Additional context The only workaround I found was to edit /textract/parsers/doc_parser.py as mentionned here