Closed skadambala closed 6 years ago
You should get acquainted with the parameters of OpenCV's hough transform and probably experiment with the hough_votes_thresh
parameter of the detect_lines
method (see the example), i.e. probably set it lower in order to detect more lines. The canny_*
parameters can also be helpful, but a lower value of hough_votes_thresh
should be enough.
Another note: MIN_COL_WIDTH
should be the approx. minimum expected column width in pixels, measured in the scanned page image. I guess your left column's width is smaller, isn't it?
@internaut Thanks for your reply.
MIN_COL_WIDTH is the width of left column measured in pixels using GIMP measurement tool. It gave me that value.
Sure, I will experiment with lowering the value of hough_votes_thresh.
Is it possible to extract tables like this, a table with only horizontal lines and no vertical lines. To a human eye, this looks like a table and can be read. Can the script help to extract such tables?
When there are no column borders, they can of course not be detected by Computer Vision algorithms like Hough transform. You'll probably have to use the distribution of x/y coordinates of the text boxes in order to find regularities (i.e. they will cluster together around certain x-positions) and hence detect the columns.
I have run catalog_30s.py, on one of my pdfs which has some text on the top and bottom and a table with 2 columns at the center like below Screen.
I changed these parameters in the script _N_COL_BORDERS = 3 MIN_COLWIDTH = 687
The output was
page 1: detecting lines in image file 'data/sample.pdf-1_1.png'...
Why is the script not able to recognise the vertical lines ? What could be the issue.