Bladieblah / xpdf-python

Python wrapper around the pdftotext functionality of xpdf
GNU General Public License v3.0
2 stars 2 forks source link

Add image extraction #13

Closed Bladieblah closed 1 year ago

Bladieblah commented 1 year ago

@ReMiOS images that require drawImageMask need 1 small fix but most images should work now

Closes #12

ReMiOS commented 1 year ago

Very Nice !!

I've made a simple test to write the images to png

test

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os

from xpydf.pdf_loader import PdfLoader
from PIL import Image

filename = os.path.join( 'test_data' , 'xpdf_tests_password.pdf')

loader = PdfLoader( filename, user_password = 'userpassword' )

# pdf_text = loader.extract_strings()
pdf_info = loader.extract_page_info()

for page in pdf_info:
    pdf_images = loader.extract_images( page[ 'page_number'] )

    for idx, image in enumerate( pdf_images):
        img = Image.fromarray( image, 'RGB' )
        img.save( f'''image_{page[ 'page_number']}_{idx}.png''' )
        print( f'''image_{page[ 'page_number']}_{idx}.png written''' )

Output:

3 images
3 images
image_1_0.png written
image_1_1.png written
image_1_2.png written

P.S. i think the "3 images" info are for testing purposes ?

Bladieblah commented 1 year ago

@ReMiOS I added some more tests, and added another file encoding. This should cover all possible image types, but no guarantees since pdfs are unpredictable and I havent tested on a lot of data yet :)

If you want you can try it yourself a bit more before I merge, if I have time I'll do some larger tests as well

ReMiOS commented 1 year ago

I'll do some more testing. First impressions look good

ReMiOS commented 1 year ago

Update: I've done more testing with various PDF's and i could not find any unexpected results.

Sample pdf's containing images: https://www.appsloveworld.com/download-sample-pdf-files-for-testing https://www.learningcontainer.com/sample-pdf-files-for-testing/

Bladieblah commented 1 year ago

Nice! I've run some tests as well and it seems to work well