Free UK Genealogy will be launching a new project to expose genealogical information from wills and probate books. These books record the date and location of people's deaths, their occupations, and often the same information about the family members that executed the wills.
In previous projects, all this material was transcribed manually by volunteers, as the source documents were handwritten. The probate books are different, however, in that they are printed and thus are accessible to OCR. We should be able to use OCR text to seed a database by parsing the text for names, dates, occupations, and relationships. We should also be able to use OCR bounding box coordinates to associate regions of a scanned page with an entry for presentation to researchers or for correction by volunteers.
Raw OCR directory containing OCR output from tesseract against the images in the source images directory.
Gold-standard CSV file containing manually-extracted entries, their bounding boxes, and the participants in each entry. This is the kind of output we're looking for, though we're interested in occupations as well.
I have already done most of the work on this issue based on the data given and will submit a Pull Request in a few hours, giving my solution to the problem. Can I be assigned this issue? Thank you.
Free UK Genealogy will be launching a new project to expose genealogical information from wills and probate books. These books record the date and location of people's deaths, their occupations, and often the same information about the family members that executed the wills.
In previous projects, all this material was transcribed manually by volunteers, as the source documents were handwritten. The probate books are different, however, in that they are printed and thus are accessible to OCR. We should be able to use OCR text to seed a database by parsing the text for names, dates, occupations, and relationships. We should also be able to use OCR bounding box coordinates to associate regions of a scanned page with an entry for presentation to researchers or for correction by volunteers.
Sample data for this project:
tesseract
against the images in the source images directory.