GateNLP / gateplugin-StringAnnotation

A plugin for the GATE language technology framework that provides gazetteer and regular expression annotator PRs for string annotation
https://gatenlp.github.io/gateplugin-StringAnnotation/
GNU Lesser General Public License v3.0
1 stars 1 forks source link

Rethink case sensitivity #7

Open johann-petrak opened 6 years ago

johann-petrak commented 6 years ago

Copied from: https://github.com/johann-petrak/gateplugin-StringAnnotation/issues/4

Consider: always store the actual case (if case-normalization is wanted, must be a preparation step for the list file). Then, if case-insensitive matching is required (then: a runtime parameter!!) use a parallel matching algorithm: for each character position, match the lower-case and upper-case version for all active matches in parallel. In theory, we could double the number of active matches at each position, but this will actually not happen and the number of active matches will be bounded by the maximum number of differently capitalized prefixes of a potential full match (or set of matches if we want to find all possible matches of any length). Then, actually create several annotations for each case-variation we matched, or just one annotation based on a preference setting (e.g. first, best case match ...?)

johann-petrak commented 6 years ago

Maybe an alternate approach would be to add per-entry information about what the actual case of the entry is: for example we could always assume all-lower case and encode the most common deviations from that in just a few bits (e.g. all upper case, title case). Only if the deviations are bigger we would store all alternate case variations with the entry. Another approach could be to have the normal trie with all-lowercase entries and just store the fact that alternate cases exist and that all-lowercase is NOT a match in a few bits. Then the alternate cases would be stored in a second compressed trie. In other words we would always build the normal non-casesensitive trie, but in addition we would also build a case-sensitive trie for those entries which are not all lowercase. If we match case sensitively, we would match the case-sensitive trie and the non-case sensitive one at the same time. If we get a non-lower case in the input, the non-case trie is abandoned. We are left we the case sensitive matches. If we match non case sensitive, we do the same but do not abandon the lowercase trie but only set the non-sensitive lower match flag. We end up with either matching the lower case entry exactly, or matching an alternatavie exactly, or matching the lower case entry case insensitively which we can then record in the annotation.