Open vesran opened 3 years ago
I am trying to read a .doc file using textract still every 80 characters, a \n is inserted when the document is read while it is not in the document.
\n
To Reproduce file = "dev/test_textract_80.doc" # path to file text = textract.process(file).decode('utf-8')
where the .doc file contains 0_1_2_3_4_..._98_99_ (every number from 0 to 99 separated with an underscore)
0_1_2_3_4_..._98_99_
Expected behavior Expected : text -> 0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30...
Current output : text -> 0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28\n_29_30... (notice the \n before _29)
Desktop
Additional context The only workaround I found was to edit /textract/parsers/doc_parser.py as mentionned here
/textract/parsers/doc_parser.py
I am trying to read a .doc file using textract still every 80 characters, a
\n
is inserted when the document is read while it is not in the document.To Reproduce file = "dev/test_textract_80.doc" # path to file text = textract.process(file).decode('utf-8')
where the .doc file contains
0_1_2_3_4_..._98_99_
(every number from 0 to 99 separated with an underscore)Expected behavior Expected : text -> 0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30...
Current output : text -> 0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28\n_29_30... (notice the
\n
before _29)Desktop
Additional context The only workaround I found was to edit
/textract/parsers/doc_parser.py
as mentionned here