deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.91k stars 607 forks source link

Missing carriage returns in PDF #307

Open Overdrivr opened 5 years ago

Overdrivr commented 5 years ago

Describe the bug When parsing a PDF file (produced from google docs), the carriage return are missing. Do you know any workaround regarding this ?

To Reproduce Steps to reproduce the behavior:

# No CR
print(textract.process('Test document 2.pdf'))

# Same document in docx format
# CR are properly found
print(textract.process('Test document 2.docx'))

Test document 2.docx

Test document 2.pdf

Expected behavior I expect to find CR.

Desktop (please complete the following information):

jpweytjens commented 5 years ago

I get the following output for both files, using both pdf extraction methods available. There are no carriage returns, but I do get newlines which I think is what you want?

>>> import textract

>>> textract.process("test.docx").decode("utf8")
'Test document\n\n\n\nThis document is for testing.\n\n\n\nCheers'
>>> textract.process("test.pdf", method="pdftotext").decode("utf8")
'Test document\nThis document is for testing.\nCheers\n\n\x0c'
>>> textract.process("test.pdf", method="pdfminer").decode("utf8")
'Test document \n \nThis document is for testing. \n \nCheers \n \n \n\n\x0c'

Textract is only a wrapper around other tools that do the actual parsing. If something is wrong with the pdf parsing output, usually the issue lies with the parsing tools and not textract. While textract isn't parsing the pdf file, I do know that extracting the layout from a pdf (such as linebreaks, positioning, ...) is a non trivial task. That being said, I wonder what you're expected behaviour would have been in the output above?

Overdrivr commented 5 years ago

Thanks for the feedback, indeed you're right I meant newlines not carriage returns. That's interesting because I was not getting the newlines at all, I was getting this:

print(textract.process('Test document 2.pdf'))
'Test document This document is for testing. Cheers  \x0c'

But maybe it's because I did not provide a method like you did. I'll give it a shot tonight and let you know.

jpweytjens commented 5 years ago

The method shouldn't matter, I just tried both to see if this issue is specific to one of the available methods. The standard method is pdftotext and textract falls back to pdfminer if it isn't available.

Do be aware that textract returns byte objects and not string objects. Printing a byte object shows \n while printing the decoded output, as in my previous reply, would print newlines instead of \n.

>>> import textract
>>> print(textract.process("test.pdf"))
b'Test document\nThis document is for testing.\nCheers\n\n\x0c'
>>> print(textract.process("test.pdf").decode("utf8"))
Test document
This document is for testing.
Cheers
Overdrivr commented 5 years ago

That's quite weird, I can reproduce the issue:

text_A = textract.process(r'C:\Users\Remi\Downloads\Test document 2.docx')
text_B = textract.process(r'C:\Users\Remi\Downloads\Test document 2.pdf')
print(text_A)
print(text_B)
b'Test document\n\n\n\nThis document is for testing.\n\n\n\nCheers'
b'Test document This document is for testing. Cheers\r\n\r\n\x0c'

I tried by passing method=pdftotext, I get the same result. And method=pdfminer crashes with the following issue:

Traceback (most recent call last):
  File "compare_app\services.py", line 27, in <module>
    text_B = textract.process(r'C:\Users\Remi\Downloads\Test document 2.pdf', method="pdfminer")
  File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)
  File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
    return self.extract_pdfminer(filename, **kwargs)
  File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
    stdout, _ = self.run(['pdf2txt.py', filename])
  File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\utils.py", line 96, in run
    stdout, stderr = pipe.communicate()
UnboundLocalError: local variable 'pipe' referenced before assignment

Are you running on Windows or Linux ?

jpweytjens commented 5 years ago

I get the same results on Windows 10 as on Linux (WSL). I did notice the same issue with the pdfminer method on Windows. There's an issue discussing it in great detail, #154, and I posted the start of a solution there. The full solution will be included in the next version of textract.

Which version of pdftotext are you using? The latest version that is easily installed on Windows is included in the command line tools from Xpdfreader. Could you try installing these and see if it changes the output?