Unable to extract text from PDF generated by Word.

msuiche commented 1 year ago

I saved a Word document as a PDF, and when I try to extract the text I get the following errors:

[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }

And the output content looks like this:

"R\n\"\n\"\n.0$((\" A*\" &1\" $++&’&51\" ’5\" $((\" 5’0*,\" ,*-*+&*.\" $2$&($A(*\" $’\" ($>\" 5,\" &1\" *K/&’=F\" ./A]*:’\" ’5\" $1=\" *Q%,*..\" \n*Q:(/.&51.\"5,\"(&-&’$’&51. \n\"\n&1\"’0&.\"B3,**-*1’\"’5\"’0*\":51’,$,=C \n\"\n\"\n\"\n!L\n\"\n!\n!’1$3’&%&><(&)’ \n+\n\"\n9*,2&:*\" ;,52&+*,\" .0$((\" +*4*1+F\" &1+*-1&4=\" $1+\" 05(+\" 0$,-(*..\" \n’0*\" #5-%$1=\" \n$1+\"\n&’. \n\"\n./A.&+&$,&*.F\" \n$44&(&$’*.F\" $1+\" ,*.%*:’&2*\" 544&:*,.F\" +&,*:’5,.F\" *-%(5=**.F\" $3*1’.F\" . \n/::*..5,.\" $1+\" %*,-&’’*+\" $..&31. \n\"\nG*$:0F\" $ \n\"\n7\n#5-%$1= \n\"\n@1+*-1&’**8H\" 4,5-\" $1+\" $3$&1.’\" $((\" (5..*.F\" +$-$3*.F\" (&$A&(&’&*.F\" +*4&:&*1:&*.F\" \n$:’&51.F\"]/+3-*1’.F\"&1’*,*.’F\"$>$,+.F\"%*1$(’&*.F\"4&1*.F\":5.’.\"5,\"*Q%*1.*.\"54\ (...)

I tried using pdfutil with the extract_text subcommand` and I get the same errors. Any recommendations on the steps I can do to debug the code to understand why parsing fails?

jrmuizel commented 1 year ago

Can you share the document?

dertuxmalwieder commented 1 year ago

Ádding myself here. It looks like Word generates different PDFs.

; curl https://www.africau.edu/images/default/sample.pdf

%PDF-1.3
%����

1 0 obj
<<
...

Now, one generated with Word (original source URL):

; head ./Sozialismusvorstellungen-der-DKP.pdf
%����1.2
treamr /LZWDecode
 ��P�[������7�8����d6�+шҸ6�ׅ�1���m���T�#���̆�(��;:Pgf3Ft�l������=M�Y��i:`kA�s��,Ƞú����HO+�
                   WRgy�����-��<lAZ��
�̰�p�2�pb�.��Z#��2���
streamr /LZWDecode    ��v5�Ø�7�Ø�9B`¥%n@���
...

I imagine that there needs to be additional decoding?

thespooler commented 1 year ago

Similarly, docs generated with LibreOffice seems to also not work. For example, running this PDF from Richard Stallman website through extract_text will output this:

{"text":{"1":["","","","","","","","","","","","","!\"#$%&","\"#","’(\"#","\"#",")","*\"#","\"#+,","","!-./","\"#","0112\"#345267","\"#","","","*","8","","9",":","","","#","&",";","$(","&<<<<))","%","","8$=:3%","","","","’.>&","&<<<<","!015550575?.(!!\"@1",""]},"errors":[]}

jrmuizel commented 1 year ago

Both of those pdfs now work with https://github.com/jrmuizel/pdf-extract

kinxiel commented 1 year ago

I have also noticed that if you create the PDF from Word using the Print option, Microsoft Print to PDF versus exporting or saving the file as a PDF, you get two different types of PDF, the latter works fine.

Although both of these types of PDFs work fine with Python based PDF libraries.

shivjm commented 1 year ago

Here’s another that won’t work, taken from https://old.cbic.gov.in/htdocs-cbec/customs/cs-act/notifications/notfns-2023/cs-nt2023/csnt44-2023.pdf: csnt44-2023.pdf. I can open it in Acrobat with no issues.

katzeprior commented 3 months ago

Anyone looking into fixing this?

Heinenen commented 3 months ago

@katzeprior the problem is probably the same as in #125, and it looks like that may be solved in the near future.

J-F-Liu / lopdf

Unable to extract text from PDF generated by Word. #217