Closed mattvv closed 11 years ago
We simply delegate to pdftotext
in order to extract text from documents that don't require OCR. Do you have a different result on this particular document using pdftotext
yourself?
Cleaning up issues and closing this for lack of activity.
I'm having an issue using extract_text on a .docx or .pdf file, It looks like when reading in the document the parser is removing the new lines. Is there any setting to ensure these are put into the new txt file? I've tried :clean => false with no luck.
Example:
ESTRAGON: (giving up again). Nothing to be done. VLADIMIR: (advancing with short, stiff strides, legs wide apart).
Converts to: ESTRAGON: (giving up again). Nothing to be done. VLADIMIR: (advancing with short, stiff strides, legs wide apart).
Expected Result: There should be \n where the line breaks are.