FreeUKGen / SummerOfCodeImages

Base images and gold standard datasets for Summer of Code projects involving computer vision and image classification/segmentation.
Apache License 2.0
4 stars 4 forks source link

Probate Parsing #7

Open benwbrum opened 6 years ago

benwbrum commented 6 years ago

Free UK Genealogy will be launching a new project to expose genealogical information from wills and probate books. These books record the date and location of people's deaths, their occupations, and often the same information about the family members that executed the wills.

In previous projects, all this material was transcribed manually by volunteers, as the source documents were handwritten. The probate books are different, however, in that they are printed and thus are accessible to OCR. We should be able to use OCR text to seed a database by parsing the text for names, dates, occupations, and relationships. We should also be able to use OCR bounding box coordinates to associate regions of a scanned page with an entry for presentation to researchers or for correction by volunteers.

Sample data for this project:

shahsaumya commented 6 years ago

I have already done most of the work on this issue based on the data given and will submit a Pull Request in a few hours, giving my solution to the problem. Can I be assigned this issue? Thank you.