RegexEntityRecognizer should not clean the text

thiagodp commented 7 years ago

In RegexEntityRecognizer the text is "cleaned" and transformed to lowercase before being processed by the regex. This causes case sensitive regexes not to work, for example. IMO, the text should not be modified before being checked by the regex, so it is better not to perform Bravey.Text.clean( text ) on it.

thiagodp commented 7 years ago

Oh, it looks like the text is also cleaned by Bravey.Nlp.Fuzzy.test() before being given to RegexEntityRecognizer.

BraveyJS commented 7 years ago

RegexEntityRecognizer and other entity recognizers are designed to be eventually used stand-alone as much as possibile, without depending on an NLP object, so you can use just what you need in your chatbot. That's why you've found Bravey.Text.clean( text ) in two often sequential places - and in most of the others entity recognizers. You can find some of these stand-alone usages in the unit tests.

RegexEntityRecognizer is thought mostly for matching parts of text via regexp and converting them to machine-readable data via callback, like language specific DateEntityRecognizer, TimeEntityRecognizer ... It works on a cleaned string in order to simplify regexp definition and its callback: since double spaces, diatrics, case and so on are cleaned, you can ignore them when creating your regexp and reduce the cases of the callback.

Whatever, what you're saying about case sensitive regexps is still right. We can make a brand new and more strict entity recognizer for manipulating the text as-is or adding an argument on constructor as you were originally proposed. What do you think?

thiagodp commented 7 years ago

A strict entity recognizer would be great. Thanks.

BraveyJS / Bravey

RegexEntityRecognizer should not clean the text #6