documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
831 stars 214 forks source link

extract_text ignores new lines #39

Closed mattvv closed 11 years ago

mattvv commented 12 years ago

I'm having an issue using extract_text on a .docx or .pdf file, It looks like when reading in the document the parser is removing the new lines. Is there any setting to ensure these are put into the new txt file? I've tried :clean => false with no luck.

Example:

ESTRAGON: (giving up again). Nothing to be done. VLADIMIR: (advancing with short, stiff strides, legs wide apart).

Converts to: ESTRAGON: (giving up again). Nothing to be done. VLADIMIR: (advancing with short, stiff strides, legs wide apart).

Expected Result: There should be \n where the line breaks are.

jashkenas commented 12 years ago

We simply delegate to pdftotext in order to extract text from documents that don't require OCR. Do you have a different result on this particular document using pdftotext yourself?

knowtheory commented 11 years ago

Cleaning up issues and closing this for lack of activity.