Option to markup split tokens in CamelCaseTokenSegmenter

dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.

https://dkpro.github.io/dkpro-core

Other

196 stars 67 forks source link

Option to markup split tokens in CamelCaseTokenSegmenter #1180

Closed mjunsilo closed 6 years ago

mjunsilo commented 6 years ago

The information whether an item was split due to camel case notation is lost once the CamelCaseTokenSegmenter has split the tokens. It will require at least an additional traversal using the same logic on the covered text to identify those items that were split, so it would be nice if that part of the text was optionally annotated. This could be a predefined annotation type, or a custom annotation type that is specified with analysis engine parameters, which would not require introducing any new kind of core types, and it could be quickly implemented. Although this is a small component, which I could quickly do myself, it would make this DKPro component more generally applicable if it supported something like that. I prefer reusing code even if it is something smaller.

reckart commented 6 years ago

A configurable type on e.g. CamelCaseTokenSegmenter sounds like it could be useful. I'm not sure at the moment if/how this would be generalizable. E.g. if a token is split by multiple "subtokenizers" applied in a sequence...

mjunsilo commented 6 years ago

I was also just thinking of an optional configurable type on CamelCaseTokenSegmenter that only produces an extra annotation on the covered area where the split token was/is. It wouldn't change the existing functionality a bit.

mjunsilo commented 6 years ago

I like to do a quick draft that you can comment on. Can I base it on the 1.9.x branch and then merge that part to the 2.0.x later? Not sure if you intent to release another 1.9.x release anytime soon, but I prefer to build it on a stable branch for now, and we can then do an internal snapshot release ourselves.

reckart commented 6 years ago

Basing the PR against the 1.9.x is a good idea :)

mjunsilo commented 6 years ago

Here is my draft version:

https://github.com/mjunsilo/dkpro-core/commit/8ef34c7057a2b2b468958e8cc1d1d4d2580d36e7

Some quick questions: Should there be an additional output type? Should I instead use a String for the markupType parameter and lookup the class?

reckart commented 6 years ago

Best create a PR - it facilitates the review process.

The parameter should be a string and instead of using reflection, you should use the CAS API which is something like this :

CAS cas = jcas.getCas();
AnnotationFS annotation = cas.createAnnotation(CasUtil.getType(cas, typeName), begin, end);
cas.addFsToIndexes(annotation);

mjunsilo commented 6 years ago

This is the branch:

https://github.com/dkpro/dkpro-core/compare/master...mjunsilo:mjuric/camel-case-tokenizer-annotations