WZBSocialScienceCenter / pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
Apache License 2.0
2.21k stars 369 forks source link

Not able to create vertical lines and recognize clusters #11

Closed skadambala closed 6 years ago

skadambala commented 6 years ago

I have run catalog_30s.py, on one of my pdfs which has some text on the top and bottom and a table with 2 columns at the center like below Screen.
image

I changed these parameters in the script _N_COL_BORDERS = 3 MIN_COLWIDTH = 687

The output was

page 1: detecting lines in image file 'data/sample.pdf-1_1.png'...

found 38 lines saving image with detected lines to 'generated_output/sample.pdf-1_1-lines-orig.png' saving image with detected lines to 'generated_output/sample.pdf-1_1-lines.png' WARNING:root:no vertical lines found no page rotation / skew found found 0 clusters Traceback (most recent call last): File "sample.py", line 140, in img_w_clusters = iproc_obj.draw_line_clusters(imgproc.DIRECTION_VERTICAL, vertical_clusters) File "build/bdist.macosx-10.12-intel/egg/pdftabextract/imgproc.py", line 395, in draw_line_clusters ZeroDivisionError: integer division or modulo by zero

Why is the script not able to recognise the vertical lines ? What could be the issue.

internaut commented 6 years ago

You should get acquainted with the parameters of OpenCV's hough transform and probably experiment with the hough_votes_thresh parameter of the detect_lines method (see the example), i.e. probably set it lower in order to detect more lines. The canny_* parameters can also be helpful, but a lower value of hough_votes_thresh should be enough. Another note: MIN_COL_WIDTH should be the approx. minimum expected column width in pixels, measured in the scanned page image. I guess your left column's width is smaller, isn't it?

skadambala commented 6 years ago

@internaut Thanks for your reply.

MIN_COL_WIDTH is the width of left column measured in pixels using GIMP measurement tool. It gave me that value.

Sure, I will experiment with lowering the value of hough_votes_thresh.

Is it possible to extract tables like this, a table with only horizontal lines and no vertical lines. To a human eye, this looks like a table and can be read. Can the script help to extract such tables?

page5_pageimageonly pdf-1_1 copy

internaut commented 6 years ago

When there are no column borders, they can of course not be detected by Computer Vision algorithms like Hough transform. You'll probably have to use the distribution of x/y coordinates of the text boxes in order to find regularities (i.e. they will cluster together around certain x-positions) and hence detect the columns.