Open Overdrivr opened 5 years ago
I get the following output for both files, using both pdf extraction methods available. There are no carriage returns, but I do get newlines which I think is what you want?
>>> import textract
>>> textract.process("test.docx").decode("utf8")
'Test document\n\n\n\nThis document is for testing.\n\n\n\nCheers'
>>> textract.process("test.pdf", method="pdftotext").decode("utf8")
'Test document\nThis document is for testing.\nCheers\n\n\x0c'
>>> textract.process("test.pdf", method="pdfminer").decode("utf8")
'Test document \n \nThis document is for testing. \n \nCheers \n \n \n\n\x0c'
Textract is only a wrapper around other tools that do the actual parsing. If something is wrong with the pdf parsing output, usually the issue lies with the parsing tools and not textract. While textract isn't parsing the pdf file, I do know that extracting the layout from a pdf (such as linebreaks, positioning, ...) is a non trivial task. That being said, I wonder what you're expected behaviour would have been in the output above?
Thanks for the feedback, indeed you're right I meant newlines not carriage returns. That's interesting because I was not getting the newlines at all, I was getting this:
print(textract.process('Test document 2.pdf'))
'Test document This document is for testing. Cheers \x0c'
But maybe it's because I did not provide a method like you did. I'll give it a shot tonight and let you know.
The method shouldn't matter, I just tried both to see if this issue is specific to one of the available methods. The standard method is pdftotext
and textract falls back to pdfminer
if it isn't available.
Do be aware that textract returns byte
objects and not string
objects. Printing a byte
object shows \n
while printing the decoded output, as in my previous reply, would print newlines instead of \n
.
>>> import textract
>>> print(textract.process("test.pdf"))
b'Test document\nThis document is for testing.\nCheers\n\n\x0c'
>>> print(textract.process("test.pdf").decode("utf8"))
Test document
This document is for testing.
Cheers
That's quite weird, I can reproduce the issue:
text_A = textract.process(r'C:\Users\Remi\Downloads\Test document 2.docx')
text_B = textract.process(r'C:\Users\Remi\Downloads\Test document 2.pdf')
print(text_A)
print(text_B)
b'Test document\n\n\n\nThis document is for testing.\n\n\n\nCheers'
b'Test document This document is for testing. Cheers\r\n\r\n\x0c'
I tried by passing method=pdftotext
, I get the same result.
And method=pdfminer
crashes with the following issue:
Traceback (most recent call last):
File "compare_app\services.py", line 27, in <module>
text_B = textract.process(r'C:\Users\Remi\Downloads\Test document 2.pdf', method="pdfminer")
File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
return self.extract_pdfminer(filename, **kwargs)
File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
stdout, _ = self.run(['pdf2txt.py', filename])
File "C:\Users\Remi\.virtualenvs\compareanything-nxT9K_aM\lib\site-packages\textract\parsers\utils.py", line 96, in run
stdout, stderr = pipe.communicate()
UnboundLocalError: local variable 'pipe' referenced before assignment
Are you running on Windows or Linux ?
I get the same results on Windows 10 as on Linux (WSL). I did notice the same issue with the pdfminer method on Windows. There's an issue discussing it in great detail, #154, and I posted the start of a solution there. The full solution will be included in the next version of textract.
Which version of pdftotext are you using? The latest version that is easily installed on Windows is included in the command line tools from Xpdfreader. Could you try installing these and see if it changes the output?
Describe the bug When parsing a PDF file (produced from google docs), the carriage return are missing. Do you know any workaround regarding this ?
To Reproduce Steps to reproduce the behavior:
Test document 2.docx
Test document 2.pdf
Expected behavior I expect to find CR.
Desktop (please complete the following information):