Refactor and simplify TokenGazetteer

GateNLP / python-gatenlp

Python text processing, pattern matching, and NLP framework

https://gatenlp.github.io/python-gatenlp/

Apache License 2.0

63 stars 8 forks source link

Refactor and simplify TokenGazetteer #109

Open johann-petrak opened 3 years ago

johann-petrak commented 3 years ago

The constructor is very complex right now. We need some way to specify/do all the things that can be done or decided at init time in a way that is easier to understand.

One part of constructing the gazetteer is dealing with the gazetteer list(s): unless we already have tokenized lists, the entries need to get tokenized, this is also necessary when using legacy Java GATE def/lst files. Instead of doing this automatically, lets separate out the task and only use already tokenized gazetteer lists when initializing (once those are created, they can get pickled for fast loading later).

johann-petrak commented 3 years ago

We need to distinguish between the data structure and the annotator, maybe something like

# one of
tokdata = SimpleTokenGazetteerData.from_gate_def("somefile.def", tokenizer=sometok, case_sensitive=False)
tokdata = SimpleTokenGazetteerData("ownpickledformatfile")
tokdata = SimpleTokenGazetteerData().from_string_list(stringlist, tokenizer=sometok)
# then
tokenizer = SimpleTokenGazetteerAnnotator(tokdata, outset=someset, ...)

Kind of ugly that this would mean that for each gazetteer annotator we need its own gazetteer data class. In theory we could use some kind of class nesting but that would cause problems with type hinting as the needed types are not defined for methods that use the nested class.

johann-petrak commented 3 years ago

OK, this is a bit more complex, since we eventually will need a standard way for saving and serializing Annotators and for that, a single constructor is better. Also we really save only a few parameters in most cases.

Also, we should maybe not have a default tokenizer to make it clearer that the gazetteer list tokenizer should match the document tokenizer.