camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.9k stars 461 forks source link

Specify layout dimensions of page #226

Open talha298 opened 3 years ago

talha298 commented 3 years ago

I am using the following script to get the page dimensions from camelot

For one document, layout dimensions came out to be 792, 612 and in another one it came to be 833, 644. Is it possible to specify in camelot that you require the page to be scaled to particular dimensions?

I am asking this because I am using PymuPDF functionality to extract words and their bounding boxes to get the column headers. The coordinates of column headers give me the column boundaries which I want to use in camelot.read_pdf() operating on stream mode. Right now, it is becoming difficult to achieve handshake between the two coordinate systems from the two packages. If camelot has a functionality to specify layout dimensions, I can provide the layout of PymuPDF to camelot

image

Steps to reproduce the bug Steps used to install camelot: pip install camelot-py==0.8.2

Code Add the Camelot code snippet that you used.

from camelot import utils

layout, dim = utils.get_page_layout(filename)
page_width = layout.width
page_height = layout.height
print(f"page_width: {page_width}, page_height: {page_height}")
akhil4rajan commented 1 year ago

Hi Team,

Is there any way we can specify the layout dimensions of the page? Will be a very useful feature.