atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Fails with OpenCV out of memory error. #454

Closed Deweywsu closed 3 years ago

Deweywsu commented 3 years ago

When I attempt to open a single-page pdf with one table, I get a failure each time with OpenCV reporting I'm out of memory.

The text of the error is:

Error: OpenCV(4.5.1)
C:\Users\appveyor\AppData\Local\Temp\1\pip-req-build-k5srx1ap\opencv\modules\core\src\alloc.cpp:73 error: 
(-4:Insufficient memory) Failed to allocate 538560000 bytes in 
function 'cv::OutOfMemoryError'

The strange thing is that Camelot works fine on other pdf files. The file that fails was modified from its original, larger, content, which had 4 very complex tables, where it also was failing, to be extremely simple, all blank except for 1 small table, and it still failed. I even output and image of the original and created a totally new pdf using that image (through Acrobat). I then tried editing the font, but also no luck. New, smaller file is attached here.

I'm wondering if maybe there is an error at a level of the structure of the pdf itself or something. I've tried Acrobat Pro's "preflight" analysis functions, but no luck as yet.

I have a Windows 10 machine running at 2.9Ghz, with 8 Gb of memory. I don't get out of memory errors with any other apps. I have 32-bit Python 3.9.2, OpenCV 4.5.1.48, Numpy 1.20.1, and Camelot-py 0.8.2. I have attempted to reinstall OpenCV, which also re-installed Numpy.

probelm pdf.pdf

Deweywsu commented 3 years ago

I couldn't believe software like Camelot, that had already worked so well, wouldn't be able to handle such a simple file. There must have been something wrong with the file itself.

Indeed, it seems that the file's dimensions were approx 44 x 34 inches. It looked like OpenCV attempted to allocate a significant amount of memory to prepare for its work, but then hit my computer's maximum. I tried to open the file in Word to correct contrast levels, thinking it was an issue where the table's lines were too dim, but Word failed, saying there were "too many pages" in the file. This turns out to be Word's default error whenever a pdf page exceeds 22 inches.

This gave me a hint, and I re-opened it with Acrobat pro's "Preflight" feature, and re-scaled it to A4 size. Camelot then had no trouble with it. Not sure what the maximum dimension one can input for a pdf, but if you are experiencing this error, try scaling the page size down.