claird / PyPDF4

A utility to read and write PDFs with Python
obsolete-https://pythonhosted.org/PyPDF2/
Other
328 stars 61 forks source link

Invalid Literal for int() ... for a PDF download from GoogleDocs SPreadSheet #86

Open cadu-leite opened 3 years ago

cadu-leite commented 3 years ago

The PDF file is attached pdf_sample_googlesheet_pages_02.pdf

traceback:

  File "/usr/local/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1599, in getObject
    idnum, generation = self.readObjectHeader(self.stream)
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'

ValueError: invalid literal for int() with base 10: b'F-1.4'

A code with test incuded can be seen at this repo (merge2pdf)

pubpub-zz commented 3 years ago

Hi Cadu-Leite, Your PDF has some free objects (outlines,JS) that are referenced. I've introduced a fixed in my pre-released (https://github.com/pubpub-zz/PyPDF4/releases/tag/1.27.0ppZZ). Please note that this version has been deeply rewritten. I've normally kept backward compatibility. I've also starting to upgrade documemtation.Can you tell me if it is ok for you?

cadu-leite commented 3 years ago

https://github.com/pubpub-zz/PyPDF4/releases/tag/1.27.0ppZZ

I Believe you change the namespace from PyPDF4 to pypdf ... it has to be in BIG LETTERS on docs.

continue...

Tha error has changed, but still on the same PDF file, a google sheet exported to PDF.

traceback - trying to red a PDF from Google Sheet.

    Traceback (most recent call last):
      File "/Users/cadu/projs/merge_pdfs/tests/test_merge2pdf.py", line 66, in test_merge_pdf_output
        m.merge_pdfs()
      File "/Users/cadu/projs/merge_pdfs/merge2pdf.py", line 89, in merge_pdfs
        merged_pdf.append(fileobj = file_name)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/merger.py", line 146, in append
        self.merge(None, fileobj, bookmark, numpages, import_bookmarks)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/merger.py", line 116, in merge
        self._copy_bookmarks(fileobj.root_object["/Outlines"], bkmark, srcpages)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/generic.py", line 430, in __getitem__
        return dict.__getitem__(self, key).getObject()
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/generic.py", line 214, in getObject
        return self.pdf.getObject(self).getObject()
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/pdfreader.py", line 488, in get_object
        retval = self._get_object_by_ref(ref, self.R_XTABLE)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/pdfreader.py", line 284, in _get_object_by_ref
        raise PdfReadError("Cannot fetch a free object (id, next gen.) = (%d, %d)"
    pypdf.utils.PdfReadError: Cannot fetch a free object (id, next gen.) = (2, 0)

.. then , eliminating the the Google Sheet PDF , taking it off the PDF list to be merged, I got another error.

Its seems you change the keyword parameters ... that not nice. It will break a lot of scripts , and you have a pythonic way to do that, you may accept both or dont change it at all.

    Traceback (most recent call last):
      File "/Users/cadu/projs/merge_pdfs/tests/test_merge2pdf.py", line 66, in test_merge_pdf_output
        m.merge_pdfs()
      File "/Users/cadu/projs/merge_pdfs/merge2pdf.py", line 87, in merge_pdfs
        merged_pdf.append(fileobj = file_name, pages = page_range)
    TypeError: append() got an unexpected keyword argument 'pages'

but ok, I changed the parameter name to numpages its work . But the problem with PDFs that comes from Google Sheet remains.

The rest seems to be ok .

cadu-leite commented 3 years ago

please let know if a can help in anything else

pubpub-zz commented 3 years ago

Yes ! thanks for the test and report. I had a look: About PyPDF4 renamed into pypdf, it is a choice from claird (don't know why) First for the issue with google sheet PDF, I forgot to tell you to set strict to false in merger init in order to make merger tolerant to 'erroneous' file: merged_pdf = PdfFileMerger(strict=False) also I've found that the API broken you've raised : I've fixed it finally I've found an issue when a NullObject is returned for outlines. fixed also. I've run successfully your test find the update of my library. changes have been committed but I would like for a few for beta tester before tagging it.

pypdf4-1.27.0PPzz_1-py2.py3-none-any.whl.zip

      Thanks for your returns.