FreeUKGen / FreeCENMigration

Issue tracking for project migrating FreeCEN to FreeCEN2 genealogy record database and search engine architecture. Code developed here is based on that developed in MyopicVicar
https://www.freecen.org.uk
Apache License 2.0
4 stars 3 forks source link

Computer Vision: identify census records on images #393

Open benwbrum opened 6 years ago

benwbrum commented 6 years ago

Many online transcription tools require users to transcribe a single record from an image, with a direct linkage between the region of the image containing a record and the transcription form. The majority of tools accomplish this by asking humans to draw a rectangle around the record (the region of interest/ROI) on the image before transcription can start. Our volunteers would far prefer to avoid this step, as they find it a distraction from transcription, which they prefer to do in a mouse-free manner. If we had a tool which would take an image file (or URL to a file) and a parameter explaining the format of the records on the image (1861 Census Form, 1851 Census Form, etc) and would produce a list of bounding box coordinates for the record locations on that image, we could skip any drawing step and present the records directly to users.

benwbrum commented 6 years ago

TODO: add sample inputs and outputs for census records.

In addition to the above, a flag (or probability) that the region of interest contains ink would be invaluable as an output.

offthewallace commented 6 years ago

Hi benwbrum I saw this idealist from google summer of code. I am wondering that is there a standard format for the input files? Because once standardized and had enough samples, a model can be trained for producing those bounding box coordinates.

zzzhacker commented 6 years ago

Hi benwbrum I am interested in this idea. I get the idea but it would we more helpful to clarify things exactly if you provide some input and output images. because now I think to solve the problem in a way something like Yolo object detection but here instead of object we have some text format.

iamgroot42 commented 6 years ago

Hi @benwbrum This idea seems really interesting! Is the list of formats available anywhere (for reference)? Depending on the amount and granularity of data available, different pipelines can be constructed to solve this problem. Also, it would be great if you could provide a sample output (the format, details) that are available for constructing such a model :D

abhiML commented 6 years ago

Hi @benwbrum I did a similar project on bill receipt data where I used a Deep Learning model to identify the different fields of data and form a bounding box around them. I also used another model to perform OCR on the bounded text data. I believe the task can be accomplished given enough training data and fixed number of fields.

Konsang commented 6 years ago

Hi @benwbrum!

Really Interesting Idea. I had a few doubts. Hoping you could clarify them. 1) Is the project research-oriented like finding the best model? Is the data currently hosted on some platform like kaggle? 2) Are there any baseline models for us to compare our results with? 3) Does it involve other functionalities like integration with other software or some nice GUI?

Thanks!

benwbrum commented 6 years ago

See sample data and description of the data at https://github.com/FreeUKGen/SummerOfCodeImages/issues/3

mnishant2 commented 6 years ago

I agree this can be done very efficiently provided enough data using deep learning.I have also in the past used traditional computer vision to do such a task on a banks forms with only a couple of scanned images.So yeah,I am sure this is achievable.I wanted to know whom to contact for submitting my proposal wrt summer of code for this project.

richpomfret commented 5 years ago

@benwbrum @PatReynolds we completed this as part of the last GSoC. However, I wonder if we might want to develop a tool as a next-step based on that past project? To discuss.

richpomfret commented 5 years ago

Actually a separate task - which was partially worked on and we now need to review. @benwbrum to review.