atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 355 forks source link

The extracted table box coordinates do not correspond to the images converted from the PDF #486

Open SWHL opened 2 years ago

SWHL commented 2 years ago

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug

Environment

Reproduction

t1

Bug fix

import camelot
import copy
import cv2

def draw_bbox(img, start_point, end_point, ratio=1):
    start_point = tuple(map(lambda x: round(x * ratio), start_point))
    end_point = tuple(map(lambda x: round(x * ratio), end_point))
    cv2.rectangle(img, start_point, end_point, (0, 255, 0), 2)

pdf_path = 'foo.pdf'
tables = camelot.read_pdf(pdf_path, flavor='lattice', backend="poppler")
table = tables[0]

table_x0, table_y0, table_x1, table_y1 = table._bbox
img = table._image[0]

ratio = 300 / 72
new_tmp_img = copy.deepcopy(img)
pdf_height = img.shape[0] / ratio
draw_bbox(new_tmp_img,
          start_point=(table_x0, pdf_height - table_y0),
          end_point=(table_x1, pdf_height - table_y1),
          ratio=ratio)
cv2.imwrite('foo_right.jpg', new_tmp_img)

t2

LxYuan0420 commented 2 years ago

Curious to know how you get this exact value of ratio = 300 / 72 and does it work for another pdf?

SWHL commented 2 years ago

Answer the question 1:

When the camelot package obtains the box coordinates by the pdfminer package, whose resolution's default value is 72 (I fogot to where I saw it), but when the camelot obtains the image by the read_pdf function, whose resolution's default value is 300. https://github.com/atlanhq/camelot/blob/cd8ac7979fe3631866fe439f07e9d6aaa5b1e5c6/camelot/io.py#L93

Answer the question 2:

You can try others.

baleris commented 1 year ago

@SWHL Tis really helped me to understand the conversion. However i have a similar problem in which i have a coordinates of an object got it from a page image(pdf page have been converted into page image). Now i want to convert these coordinates into camelot pdf level coordinates. I tried to follow above logic in reverse order which is not successful. I am new to this, any leads can give some hints/logic for page image co-ordinates conversion to pdf level co-ordinates ? i have object coordinates - x0,y0,x1,y1 (from page image), also have page image width and height. Also holding target pdf height n width. Ex: (x0,y0,x1,y1) = 188, 393, 1576, 1498 pageImage height,width = (3300, 2550) pdf height,width = (792, 612)

SWHL commented 1 year ago

@baleris You can try it by this:

\frac{2550}{612} = \frac{188}{x}  \rightarrow x?
\frac{3300}{792} = \frac{393}{y}   \rightarrow y?
baleris commented 1 year ago

@SWHL, this has not worked, when i checked camelot detected table coordinates they are totally different. For example for the above mentioned coordinates, camelot's relevant coordinates are (72.0, 295.2, 563.04, 648.72)

baleris commented 1 year ago

@SWHL i see in your above solution you are getting a page image from img = table._image[0] if i have a borderless table and i would like to pass flavor = ''stream' : camelot.read_pdf(src,flavor = 'stream') in tis case how could i get image ? If i try to do same like table._image[0] i get an error message.

Any suggestions to get image for "stream" parameter/borderless tables ?

SWHL commented 1 year ago

You can refer this: https://github.com/atlanhq/camelot/blob/cd8ac7979fe3631866fe439f07e9d6aaa5b1e5c6/tests/test_common.py#L35-L40

The current issue is beyond the scope of this issue. Suggest opening a new issue to discuss.

baleris commented 1 year ago

@SWHL as suggested i have raised new issue #497