dstl / baleen

Entity Extraction Text Processor
Apache License 2.0
148 stars 40 forks source link

Changes to the way regular expressions work in CleanPunctuation and T… #22

Closed jle123 closed 8 years ago

jle123 commented 8 years ago

This branch contains some minor regex updates from the CCD-DE project. It consists of two separate main parts:

  1. CleanPunctuation.java can now accept a dash ("-") at either the beginning or end of an entity. An example of where this is useful is in accepting negative values such as -$30.00.
  2. TimeRegex.java now only accepts times between 0:00 and 24:59:59. Both 0 and 24 are accepted for midnight. This class can now recognise times separated by dots (".") as well as colons (":"). Additionally, xxxxh and xxxxhours notation is now accepted, so "0700h" will be recognised as a time entity. Changes to this class honour DSTL's changes to this class for Baleen 2.1.