doxakis / form-segmentation

Let's explore how we can extract text from forms
MIT License
45 stars 18 forks source link

Help and suggestions #3

Closed nishantbhat closed 6 years ago

nishantbhat commented 6 years ago

Am not able to get how to extract the single line of handwritten text from the form.please can you help me to figure out how I can do that

doxakis commented 6 years ago

Hi,

this repo was more like an exploration and evaluate some options in the market.

Unfortunately, I didn't find any viable solution for my use case. So, I created a new tool which would focus on a specific segmentation method.

Most of our documents were using joined frame. The tool support joined frame only right now, but I'm open to check for others as well. Joined frame was easier to support on unknown document and scanned documents with bad quality..

I thought initially that a could clean the paper-based form and then use Tesseract to extract the text. I got poor results and I didn't feel confident enough to put it in production.

I found a paper which describe a way to detect fields. (https://github.com/doxakis/ICR-detection-in-filled-form/blob/master/papers/ICR%20Detection%20in%20filled%20form%20and%20form%20removal.pdf)

The idea is use line junctions to compose the field and improve confident if we find enougth adjacent cells.

I created this repo: https://github.com/doxakis/ICR-detection-in-filled-form

I did a lot and experiments (c++, python, c#) and finally, the end result has been moved to https://github.com/OpenFieldReader/OpenFieldReader.

I prepared a linux package which contains the c# code to find the fields. I plan doing a wrapper (python), but I didn't take the time.

I will continue the discussion which will give more details on how to use it in the issue you opened here: https://github.com/OpenFieldReader/OpenFieldReader/issues/1

doxakis commented 6 years ago

Hi, I will close the issue for now. Feel free to open it again if you have any question.