dstl / baleen

Entity Extraction Text Processor
Apache License 2.0
148 stars 40 forks source link

Normalize telephone numbers #20

Closed jle123 closed 8 years ago

jle123 commented 8 years ago

Telephone numbers would appear as doubles in a database. This change fixes this so that they appear as string instead to escape from a ".0" at the end or being put into scientific notation. To do this, regex.Telephone, cleaners.CleanPunctuation and cleaners.NormalizeTelephoneNumbers with a parameter "prefix: " must be added to the pipeline file (in that order).

This change also contains supporting changes to Entity, Entity_Type and semantic_type_system.xml so isNormalized() can be accessed for type Entity. Some changes needed to be made elsewhere for this not to cause errors. SharedDocumentCheckerResource is one of these changes, which is used for a lot of the other branches in the CCD-DE project.