Research-IT-Swiss-TPH / pdf-form-filling-api

API to read, fill and flatten PDF forms.
https://143.198.242.211.sslip.io/
0 stars 2 forks source link

Check for filling field type "image" possibilites #2

Closed tertek closed 10 months ago

tertek commented 1 year ago

Check if there is an open source possibility to fill images into pdf form fields. Or implement ourselves.

tertek commented 12 months ago

@edenst-TPH I have found a Python library that is able to read/write PDF form fields (more supported than pdftk, but still not image fields) BUT it shall be able to modify images within a PDF.

borb: https://github.com/jorisschellekens/borb examples: https://github.com/jorisschellekens/borb-examples

tertek commented 12 months ago

@edenst-TPH I have tested the library and it is not sufficient. I continue to look for solutions

tertek commented 12 months ago

Ich habe eine weitere library ausprobiert. Dieses mal von Apache: https://pdfbox.apache.org/

Das Tool läuft sehr robust, jedoch sind leider Image Field Types nicht supported.

Falls du bereits Docker installiert hast, kannst du die Repo hier finden und testen: https://github.com/tertek/pdfbox-docker

In der Zwischenzeit habe ich ein Email an die Apache PDFBox Community geschickt und gefragt, ob es eine einfache Möglichkeit gibt Image Fields zu befüllen. Vielleicht kommt da ja was ...

tertek commented 12 months ago

next to check: https://github.com/michaelrsweet/pdfio

edenst-TPH commented 11 months ago

Hi Ekin

Ja, PDFio hat scheints einige Funktionen für Images

Liebe Grüsse, Stephan

From: "Ekin Tertemiz" @.> To: "Research-IT-Swiss-TPH/pdftk-api" @.> Cc: "Stephan Edenhofer" @.>, "Mention" @.> Date: 04.12.2023 08:20 Subject: Re: [Research-IT-Swiss-TPH/pdftk-api] Check for filling field type "image" possibilites (Issue #2)

next to check: https://github.com/michaelrsweet/pdfio — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>


This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately by reply e-mail and delete this message from your system.

tertek commented 11 months ago

Ich glaub aber nur zum Erstellen, leider nicht zum Auslesen / Ändern...

tertek commented 11 months ago

Aber hier mit pyMuPDF geht es scheinbar:

https://github.com/pymupdf/PyMuPDF/discussions/924#discussioncomment-412135

https://pymupdf.readthedocs.io/en/latest/index.html

tertek commented 11 months ago

Oh und hier ist noch eine library, die sehr stark zu sein scheint:

https://pdf-lib.js.org/

tertek commented 11 months ago

pdf-lib.js supported image extraxction / replacement aber leider nicht.

Hier ist eine Anleitung wie man mit Python und pdftk Bilder ersetzen kann:

https://arunmozhi.in/2019/01/29/replacing-image-in-a-pdf-with-python/

tertek commented 11 months ago

Zusammenfassung:

Sieht doch gut aus oder?

edenst-TPH commented 11 months ago

From: "Ekin Tertemiz" @.> To: "Research-IT-Swiss-TPH/pdftk-api" @.> Cc: "Stephan Edenhofer" @.>, "Mention" @.> Date: 05.12.2023 10:07 Subject: Re: [Research-IT-Swiss-TPH/pdftk-api] Check for filling field type "image" possibilites (Issue #2)

Aber hier mit pyMuPDF geht es scheinbar: pymupdf/PyMuPDF#924 (comment) https://pymupdf.readthedocs.io/en/latest/index.html — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Antwort von ChatGPT :-)


This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately by reply e-mail and delete this message from your system.

*** from ChatGPT: replace an image in PDF, using Pyton with PyMuPDF ****

pip install pymupdf

import fitz # PyMuPDF library

def replace_image(pdf_path, image_path, page_number, new_image_path, output_path):

Open the PDF file

pdf_document = fitz.open(pdf_path)

# Load the image to be replaced
new_image = pdf_document.new_image_from_file(new_image_path)

# Get the specified page
page = pdf_document[page_number - 1]

# Get the existing images on the page
images = page.get_images(full=True)

# Choose the image you want to replace (you may need to adjust this based on your PDF structure)
if images:
    old_image = images[0][0]
    xref = old_image[0]

    # Replace the old image with the new one
    page.get_images(full=True)
    page.set_image(xref, new_image)

    # Save the changes to a new PDF file
    pdf_document.save(output_path)

    # Close the PDF document
    pdf_document.close()

    print(f"Image replaced successfully in page {page_number} of {output_path}")
else:
    print(f"No images found on page {page_number}")

Example usage

replace_image("input.pdf", "new_image.jpg", 1, "output.pdf")

tertek commented 11 months ago

von der Antwort oben: https://github.com/pymupdf/PyMuPDF/discussions/924#discussioncomment-412135

doc = fitz.open(<your pdf filename>)
page = doc[pno] # read the page at page number pno
img_list = page.get_images(full=True) # a list of all images on that page
# select the item referencing the old image (hope you know how to identify it!)
# Each item looks like: (1315, 0, 1945, 1004, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode', 0)
# first entry is xref, etc.
bbox = page.get_image_bbox(item)  # where the old image lives
ra = page.addRedactAnnot(bbox)  # mark that rectangle as to-be-deleted
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_REMOVE)  # delete old image
page.insert_image(bbox, filename=<imagefile>)  # insert new image

Mal sehen welche besser ist. Probierst du es aus?

tertek commented 11 months ago

To Do:

edenst-TPH commented 11 months ago

FormWithImagesInLayout.pdf Acrobat Pro lässt mich Bilder einsetzen aber ohne ID oder sonstige Attribute; keine Info/Properties zu eingesetzten Images. Export als xlsx oder html bringt nichts

edenst-TPH commented 11 months ago

pdf-lib.js supported image extraxction / replacement aber leider nicht.

Hier ist eine Anleitung wie man mit Python und pdftk Bilder ersetzen kann:

https://arunmozhi.in/2019/01/29/replacing-image-in-a-pdf-with-python/

Ja, damit kann man offenbar eine image list extrahieren

edenst-TPH commented 11 months ago

pdfimages (Teil der poppler-utils) kann images eines PDFs auslisten und auch exportieren https://arunmozhi.in/2019/01/29/replacing-image-in-a-pdf-with-python/ https://www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/

install poppler, it comes with pdfimages sudo apt-get install poppler-utils pdfimages --version

let pdftk free the pdf (and uncompress it for better handling) pdftk my.pdf output myFreeUncompressed.pdf uncompress

list images-in-pdf, incl Object ID pdfimages -list myFreeUncompressed.pdf

export images-from-pdf-file, named by (image id??) pdfimages myFreeUncompressed.pdf ./export/

tertek commented 11 months ago

@edenst-TPH Wofür wäre das exportieren nützlich? Mir fällt gerade kein Fall ein.

tertek commented 11 months ago

test_img_uc.pdf

Ich habe pyMuPDF getestet:

# main.py                                                      

import fitz # imports the pymupdf library

doc = fitz.open("test_img_uc.pdf") # open a document
page = doc[0]
img_list = page.get_images(full=True) # a list of all images on that page

item = img_list[0]
bbox = page.get_image_bbox(item)

print(img_list)

print(bbox)

Image

Man müsste noch schauen, ob man anhand der ID, das Bild ersetzen kann.

edenst-TPH commented 11 months ago

@edenst-TPH Wofür wäre das exportieren nützlich? Mir fällt gerade kein Fall ein.

Wenn man das Bild welches durch den QR ersetzt werden soll, als Bild erkennen und mit dem Image-List Eintrag matchen kann, hat man dort die PDF object ID - matchen ist das Problem, evtl hilft die genaue Abmessung & Filesize? imglist.txt

tertek commented 10 months ago

Image filling has been identified as a rather cumbersome task and will be left out for now.