ePADD / epadd

ePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.
https://www.epaddproject.org
112 stars 24 forks source link

Entity improvements needed #198

Open hangal opened 6 years ago

hangal commented 6 years ago

Please collect entity improvements here.

From Fikes archive, under People:

  1. Need to be careful of all caps. image

  2. Merge case and remove special chars at the end of an entity: (Note: apostrophe and hyphen are allowed within an entity, but not at the end) image

image

  1. Not clear why "In II" should have more confidence than "Donna Lawrence"

image

hangal commented 6 years ago

University type also includes school. We should clarify that.

Stop words could be removed at the beginning of an entity if also < 1 confidence (and esp. if all caps). See below:

image

image

image

image

Just the word University (or High School) should be in a kill list: image

hangal commented 6 years ago

Some words that should not be recognized as places when standalone. House, Lake, City, Green, Street, Bay, Point, Town, Line, Island, Mountain, Point, Santa, Island, Street, Glacier, Fort, Man, Field, Beach, County, Empire, Highway, Forest, Moon, Camp, Light, Bank, North, South, East, West.

Some words that should not be recognized as companies when standalone: International

Some words that should not be recognized as publications when standalone: Times.