FreeUKGen / SummerOfCodeImages

Base images and gold standard datasets for Summer of Code projects involving computer vision and image classification/segmentation.
Apache License 2.0
4 stars 4 forks source link

Probate Parsing Solution #8

Open shahsaumya opened 6 years ago

shahsaumya commented 6 years ago

Refer Issue #7

The system that I propose to implement is an end-to-end system that extracts the text from probate books and seeds them into a database with entities such as name, county, date, relationships etc. This system can, therefore, be broken down into three phases -

  1. Text extraction using Optical Character Recognition
  2. Named Entity Recognition using Language Processing
  3. Database Seeding based on the entities generated

Due to lack of samples to train a Named Entity Recognizer, I've made use of the Stanford NER Wrapper and NLTK to produce the results.