Open munikarmanish opened 5 years ago
Hi, @munikarmanish !
You're correct. The OCR currently only works for pre-processed images.
While it does extract data from PDFs with tables, it currently performs a horizontal scan and doesn't perform any table based classification on the text yet, I'm still trying to figure out how to make that work.
A make-do solution could be to classify the text after extraction based on the length of columns but that will only work if every column has a fixed length of words, which is not the case in most scenarios.
The way to do this is to use code to do table detection (column and row) and then preform the ocr within the table it's a really hard problem though.
Hi @munikarmanish did you found any thing regarding the research you mentioned above ?
Hi @munikarmanish did you found any thing regarding the research you mentioned above ?
Yes, I've found a few interesting approaches:
I am also facing above issue. did any found best solution after 2 years?
Right now, it only seems to perform OCR. i.e., convert image to raw text. Is there any table-specific extraction performed? Basically, I'm researching about good algorithms to extract tabular data from scanned documents.
Thanks in advance. :)