claird / PyPDF4

A utility to read and write PDFs with Python
328 stars 61 forks source link

Invalid Literal for int() ... for a PDF download from GoogleDocs SPreadSheet #86

Open cadu-leite opened 3 years ago

cadu-leite commented 3 years ago

The PDF file is attached pdf_sample_googlesheet_pages_02.pdf


  File "/usr/local/lib/python3.8/site-packages/PyPDF2/", line 1599, in getObject
    idnum, generation = self.readObjectHeader(
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'

ValueError: invalid literal for int() with base 10: b'F-1.4'

A code with test incuded can be seen at this repo (merge2pdf)

pubpub-zz commented 3 years ago

Hi Cadu-Leite, Your PDF has some free objects (outlines,JS) that are referenced. I've introduced a fixed in my pre-released ( Please note that this version has been deeply rewritten. I've normally kept backward compatibility. I've also starting to upgrade documemtation.Can you tell me if it is ok for you?

cadu-leite commented 3 years ago

I Believe you change the namespace from PyPDF4 to pypdf ... it has to be in BIG LETTERS on docs.


Tha error has changed, but still on the same PDF file, a google sheet exported to PDF.

traceback - trying to red a PDF from Google Sheet.

    Traceback (most recent call last):
      File "/Users/cadu/projs/merge_pdfs/tests/", line 66, in test_merge_pdf_output
      File "/Users/cadu/projs/merge_pdfs/", line 89, in merge_pdfs
        merged_pdf.append(fileobj = file_name)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/", line 146, in append
        self.merge(None, fileobj, bookmark, numpages, import_bookmarks)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/", line 116, in merge
        self._copy_bookmarks(fileobj.root_object["/Outlines"], bkmark, srcpages)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/", line 430, in __getitem__
        return dict.__getitem__(self, key).getObject()
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/", line 214, in getObject
        return self.pdf.getObject(self).getObject()
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/", line 488, in get_object
        retval = self._get_object_by_ref(ref, self.R_XTABLE)
      File "/Users/cadu/.virtualenvs/merge_pdfs_pypdf4/lib/python3.8/site-packages/pypdf/", line 284, in _get_object_by_ref
        raise PdfReadError("Cannot fetch a free object (id, next gen.) = (%d, %d)"
    pypdf.utils.PdfReadError: Cannot fetch a free object (id, next gen.) = (2, 0)

.. then , eliminating the the Google Sheet PDF , taking it off the PDF list to be merged, I got another error.

Its seems you change the keyword parameters ... that not nice. It will break a lot of scripts , and you have a pythonic way to do that, you may accept both or dont change it at all.

    Traceback (most recent call last):
      File "/Users/cadu/projs/merge_pdfs/tests/", line 66, in test_merge_pdf_output
      File "/Users/cadu/projs/merge_pdfs/", line 87, in merge_pdfs
        merged_pdf.append(fileobj = file_name, pages = page_range)
    TypeError: append() got an unexpected keyword argument 'pages'

but ok, I changed the parameter name to numpages its work . But the problem with PDFs that comes from Google Sheet remains.

The rest seems to be ok .

cadu-leite commented 3 years ago

please let know if a can help in anything else

pubpub-zz commented 3 years ago

Yes ! thanks for the test and report. I had a look: About PyPDF4 renamed into pypdf, it is a choice from claird (don't know why) First for the issue with google sheet PDF, I forgot to tell you to set strict to false in merger init in order to make merger tolerant to 'erroneous' file: merged_pdf = PdfFileMerger(strict=False) also I've found that the API broken you've raised : I've fixed it finally I've found an issue when a NullObject is returned for outlines. fixed also. I've run successfully your test find the update of my library. changes have been committed but I would like for a few for beta tester before tagging it.

      Thanks for your returns.