Closed msuiche closed 1 month ago
Can you share the document?
Ádding myself here. It looks like Word generates different PDFs.
; curl https://www.africau.edu/images/default/sample.pdf
%PDF-1.3
%����
1 0 obj
<<
...
Now, one generated with Word (original source URL):
; head ./Sozialismusvorstellungen-der-DKP.pdf
%����1.2
treamr /LZWDecode
��P�[������7�8����d6�+шҸ6�ׅ�1���m���T�#���̆�(��;:Pgf3Ft�l������=M�Y��i:`kA�s��,Ƞú����HO+�
WRgy�����-��<lAZ��
�̰�p�2�pb�.��Z#��2���
streamr /LZWDecode ��v5�Ø�7�Ø�9B`¥%n@���
...
I imagine that there needs to be additional decoding?
Similarly, docs generated with LibreOffice seems to also not work. For example, running this PDF from Richard Stallman website through extract_text will output this:
{"text":{"1":["","","","","","","","","","","","","!\"#$%&","\"#","’(\"#","\"#",")","*\"#","\"#+,","","!-./","\"#","0112\"#345267","\"#","","","*","8","","9",":","","","#","&",";","$(","&<<<<))","%","","8$=:3%","","","","’.>&","&<<<<","!015550575?.(!!\"@1",""]},"errors":[]}
Both of those pdfs now work with https://github.com/jrmuizel/pdf-extract
I have also noticed that if you create the PDF from Word using the Print option, Microsoft Print to PDF versus exporting or saving the file as a PDF, you get two different types of PDF, the latter works fine.
Although both of these types of PDFs work fine with Python based PDF libraries.
Here’s another that won’t work, taken from https://old.cbic.gov.in/htdocs-cbec/customs/cs-act/notifications/notfns-2023/cs-nt2023/csnt44-2023.pdf: csnt44-2023.pdf. I can open it in Acrobat with no issues.
Anyone looking into fixing this?
@katzeprior the problem is probably the same as in #125, and it looks like that may be solved in the near future.
I saved a Word document as a PDF, and when I try to extract the text I get the following errors:
And the output content looks like this:
I tried using
pdfutil
with theextract_text
subcommand` and I get the same errors. Any recommendations on the steps I can do to debug the code to understand why parsing fails?