claird / PyPDF4

A utility to read and write PDFs with Python
obsolete-https://pythonhosted.org/PyPDF2/
Other
328 stars 61 forks source link

Error when using mergePage: all of the page is merged into one page #18

Closed khasburrahman closed 5 years ago

khasburrahman commented 5 years ago

I recently tried to merge a background to a pdf There are 2 pdf file, the background and the content background.pdf content.pdf

I use this code to merge both files

from PyPDF4 import PdfFileReader, PdfFileWriter, PdfFileMerger
from io import BytesIO

contentFile = open('content.pdf', 'rb')
backgroundFile = open('background.pdf', 'rb')

contentReader = PdfFileReader(contentFile)
bgReader = PdfFileReader(backgroundFile)

writer = PdfFileWriter()
for x in range (contentReader.getNumPages()):
    bg = bgReader.getPage(0)
    content = contentReader.getPage(x)
    bg.mergePage(content)
    writer.addPage(bg)

output = open('result.pdf', 'wb')
writer.write(output)

the results: the background is copied to all of the page, but the all the content seems to merged on each page is there's something wrong in my code? result.pdf

acsor commented 5 years ago

the results: the background is copied to all of the page, but the all the content seems to merged on each page

What do you mean by "but ~the~ all the content seems to [be] merged on each page"? (Sorry for the edits, it's just for clarity :-).)

By what I can see, the "content PDF" seems to be merged twice into the output, which is slightly noticeable for the text thickness.

khasburrahman commented 5 years ago

@newnone Thanks for the edit, apologize for my lack of clarity 😄

I mean the generated result is something like this each page is having the background.PDF and page 1-4 of the content.PDF

What I intended to do is add the background behind each content page so the page 1 would be background.PDF and page 1 of the content.PDF page 2 would be background.PDF and page 2 of the content.PDF etc..

acsor commented 5 years ago

On a more careful inspection, I have noted that each of the four pages from content.pdf were merged into a single page of result.pdf.

I have edited the code and the problem appears like solved. Hopefully, this is a problem from the script at hand and not from the library:

from io import BytesIO                                                         
from os.path import abspath, dirname, join, pardir                             
from sys import path                                                           

SCRIPT_ROOT = dirname(__file__)                                                
PROJECT_ROOT = abspath(join(SCRIPT_ROOT, pardir))                              

path.append(PROJECT_ROOT)                                                      

from pypdf4.pdf import PdfFileReader, PdfFileWriter                            
from pypdf4.merger import PdfFileMerger                                        

contentFile = open(join(SCRIPT_ROOT, 'content.pdf'), 'rb')                     
backgroundFile = open(join(SCRIPT_ROOT, 'background.pdf'), 'rb')               

contentReader = PdfFileReader(contentFile)                                     
bgReader = PdfFileReader(backgroundFile)                                       
writer = PdfFileWriter()                                                       
bg = bgReader.getPage(0)                                                       

for pagenum in range(contentReader.numPages):                                  
    page = contentReader.getPage(pagenum)                                      
    page.mergePage(bg)                                                         
    writer.addPage(page)                                                       

output = open(join(SCRIPT_ROOT, 'result.pdf'), 'wb')                           
writer.write(output)                                                           

contentFile.close()                                                            
backgroundFile.close()                                                         
output.close()

This is result.pdf that is generated by the updated script.

The explanation for why this was happening is fairly simple. In the previous version of the code, in

for x in range (contentReader.getNumPages()):
    bg = bgReader.getPage(0)

getPage(0) plausibly returned a reference to the same object, although bg = bgReader.getPage(0) was invoked on each iteration; then, the merge effects were accumulated at each stage. With the updated for-loop body, it is the page from content.pdf that performs a merge (taking as an argument the only page from background.pdf), which is distinct from all the others.

acsor commented 5 years ago

getPage(0) plausibly returned a reference to the same object

Yes, if I do:

bg = bgReader.getPage(0)

for pagenum in range(contentReader.numPages):
    # The is operator checks for identity and differs from ==
    print(bgReader.getPage(0) is bg)
    page = contentReader.getPage(pagenum)
    ...

the console prints:

$ python3 ./merge.py
True
True
True
True

That says it all.

khasburrahman commented 5 years ago

@newnone Thanks for the example 👍

That will work fine if the background.PDF doesn't have any image overlapping the content I tried with different background2.pdf that has a block of image. the result would make the background blocking the content. that's why I merge like this

for x in range (contentReader.getNumPages()):
    bg = bgReader.getPage(0)

but it didn't work like the earlier do you have any suggestion ?

acsor commented 5 years ago

You just need to have a copy of bg such that on each iteration bg == bgReader.getPage(0) and bg is not bgReader.getPage(0). A common solution in other programming languages is to have a copy constructor.

PyPDF has nothing of that sort and we need to resort to other means. I do not consider this solution safe, but considering that PageObject inherits from dict we can leverage on dict.update(). I cannot ensure this will work in future versions of PyPDF, but it did now:

from io import BytesIO
from os.path import abspath, dirname, join, pardir
from sys import path

SCRIPT_ROOT = dirname(__file__)
PROJECT_ROOT = abspath(join(SCRIPT_ROOT, pardir))

path.append(PROJECT_ROOT)

from pypdf4.pdf import PdfFileReader, PdfFileWriter, PageObject
from pypdf4.merger import PdfFileMerger

contentFile = open(join(SCRIPT_ROOT, 'content.pdf'), 'rb')
backgroundFile = open(join(SCRIPT_ROOT, 'background2.pdf'), 'rb')

contentReader = PdfFileReader(contentFile)
bgReader = PdfFileReader(backgroundFile)
writer = PdfFileWriter()
bgTemplate = bgReader.getPage(0)

for pagenum in range(contentReader.numPages):
    bgCopy = PageObject(bgReader, bgTemplate.indirectRef)
    bgCopy.update(bgTemplate)
    # Replace with a call to assert or to "raise *Exception" if you go to production
    print(bgCopy == bgTemplate and bgCopy is not bgTemplate)

    page = contentReader.getPage(pagenum)
    bgCopy.mergePage(page)
    writer.addPage(bgCopy)

output = open(join(SCRIPT_ROOT, 'result.pdf'), 'wb')
writer.write(output)

contentFile.close()
backgroundFile.close()
output.close()
$ python3 ./Issue18/merge.py 
True
True
True
True

with result.pdf as output.

khasburrahman commented 5 years ago

😄 Thank you very much! I'll close this issue as this is solved!

acsor commented 5 years ago

Well done, feel free to follow myself if you like to stay up to date with PyPDF (or just want to exchange the small favor).