Closed mjunsilo closed 6 years ago
A configurable type on e.g. CamelCaseTokenSegmenter sounds like it could be useful. I'm not sure at the moment if/how this would be generalizable. E.g. if a token is split by multiple "subtokenizers" applied in a sequence...
I was also just thinking of an optional configurable type on CamelCaseTokenSegmenter that only produces an extra annotation on the covered area where the split token was/is. It wouldn't change the existing functionality a bit.
I like to do a quick draft that you can comment on. Can I base it on the 1.9.x branch and then merge that part to the 2.0.x later? Not sure if you intent to release another 1.9.x release anytime soon, but I prefer to build it on a stable branch for now, and we can then do an internal snapshot release ourselves.
Basing the PR against the 1.9.x is a good idea :)
Here is my draft version:
https://github.com/mjunsilo/dkpro-core/commit/8ef34c7057a2b2b468958e8cc1d1d4d2580d36e7
Some quick questions: Should there be an additional output type? Should I instead use a String for the markupType parameter and lookup the class?
Best create a PR - it facilitates the review process.
The parameter should be a string and instead of using reflection, you should use the CAS API which is something like this :
CAS cas = jcas.getCas();
AnnotationFS annotation = cas.createAnnotation(CasUtil.getType(cas, typeName), begin, end);
cas.addFsToIndexes(annotation);
The information whether an item was split due to camel case notation is lost once the CamelCaseTokenSegmenter has split the tokens. It will require at least an additional traversal using the same logic on the covered text to identify those items that were split, so it would be nice if that part of the text was optionally annotated. This could be a predefined annotation type, or a custom annotation type that is specified with analysis engine parameters, which would not require introducing any new kind of core types, and it could be quickly implemented. Although this is a small component, which I could quickly do myself, it would make this DKPro component more generally applicable if it supported something like that. I prefer reusing code even if it is something smaller.