Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.6k stars 194 forks source link

Problem converting from Image coordinates to PDF coordinates #261

Open Pocoyo7798 opened 1 year ago

Pocoyo7798 commented 1 year ago

Hi! I have a code to extract tables from pdf files. To identify the tables i´m using a layoutparser, hence I need to convert the image coordinates into pdf coordinates. To do this, I have a code where the pdf is converted into image using pdf2image, the layout model extract runs in each page image getting the blocks coordinates and type, the image size is obtained using pillow and the pdf page size is obtained using PyPDF2. Having this the convertion is done using the following equation for all 4 box coordinates (x1, y1, x1, y2) x1 = image_box_x_1 * pdf_width / image_width The code is the following:

def find_blocks_layoutparser(file_path: str, pdf, model): page_list = convert_from_path(file_path) block_boxes = [] extracted_blocks = {} page_index = 0

Initiate the parser model

for page in page_list:
    page.save(f'page{page_index}.jpg')
    # Detect all block in a page
    layout = model.detect(page)
    boxes = []
    width, height = page.size
    pdf_page = pdf.pages[page_index]
    pdf_size = pdf_page.mediabox
    pdf_width = pdf_size[2] - pdf_size[0]
    pdf_height = pdf_size[3] - pdf_size[1]
    for entry in layout:
        # Retrieve the bounding box
        x1 = entry.block.x_1 / width * float(pdf_width)
        x2 = entry.block.x_2 / width * float(pdf_width)
        y1 = entry.block.y_1 / height * float(pdf_height)
        y2 = entry.block.y_2 / height * float(pdf_height)
        boxes.append([x1, y1, x2, y2])

The tectangles obtained are the follwing: Caso5

This is the first pdf that I had this problem, every test before this was ok. Since, the block coordinates are correct for each page image (I verify it). I think the problem is with the conversion of the pdf to image. Someone have any idea on how to solve this problem?

Thanks in advance!