Extracting table data? - Githubissues

cseas / ocr-table

Extract tables from scanned image PDFs using Optical Character Recognition.

MIT License

261 stars 64 forks source link

Extracting table data? #1

Open munikarmanish opened 5 years ago

munikarmanish commented 5 years ago

Right now, it only seems to perform OCR. i.e., convert image to raw text. Is there any table-specific extraction performed? Basically, I'm researching about good algorithms to extract tabular data from scanned documents.

Thanks in advance. :)

cseas commented 5 years ago

Hi, @munikarmanish !

You're correct. The OCR currently only works for pre-processed images.

While it does extract data from PDFs with tables, it currently performs a horizontal scan and doesn't perform any table based classification on the text yet, I'm still trying to figure out how to make that work.

A make-do solution could be to classify the text after extraction based on the length of columns but that will only work if every column has a fixed length of words, which is not the case in most scenarios.

aribornstein commented 5 years ago

The way to do this is to use code to do table detection (column and row) and then preform the ocr within the table it's a really hard problem though.

jaysinghr commented 5 years ago

Hi @munikarmanish did you found any thing regarding the research you mentioned above ?

munikarmanish commented 5 years ago

Hi @munikarmanish did you found any thing regarding the research you mentioned above ?

Yes, I've found a few interesting approaches:

SAIVENKATARAJU commented 2 years ago

I am also facing above issue. did any found best solution after 2 years?