OpenSextant / Xponents

Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Apache License 2.0
44 stars 7 forks source link

Trivial "Do Do" false-positives #54

Open mubaldino opened 4 years ago

mubaldino commented 4 years ago

Describe the bug "Do. Do", "do. Do", "in Do"`, etc. are common false positives found still.

To Reproduce Xponents 3.3

Expected behavior Better filtering of these. Likely use a spaCy NER model to offer POS tags and eliminate obvious errs.

mubaldino commented 4 years ago

Add "text_norm" to indexer to review common false-pos still appearing.

mubaldino commented 4 years ago

Addressed in part by NonSenseFilter -- removing lowercase matches.

mubaldino commented 2 years ago

Seems more like gazetteer ETL fixes than a pattern generalization. If such trivial gazetteer entries should never be tagged, then we mark them search_only=1