Introduce "form" feature on tokens (1.12.0)

reckart commented 6 years ago

This is a follow-up of #953

Add a new feature to the Token to represent the "form" of a token. However, a tokenizer may choose to set this feature differently to establish a basic normalization without having to resort to actually materializing this normalization in the underlying text. In particular tokenizers that do context sensitive normalization might profit from this, e.g. PTB quote normalization where the left/right context of the quote needs to be taken into account to identify it as an opening or closing quote.

Normally, this is the underlying text. In order to save space, it would be conceivable to implement a custom getter for this that returns getDocumentText() if the feature is not explicitly set. Likewise, the setter would set the feature internally to null if the form corresponds to getDocumentText().

[ ] switch segmenters to provide token form when generating token
[ ] add option to SegmenterBase to suppress setting of form
[ ] add it in the UML diagrams in the type system documentation

Open questions:

How to deal with cases where we currently call e.g. Sentence.getCoveredText() or NamedEntity.getCoveredText() which bypass the tokens and go directly to the CAS text? -- Probably the covered text should be used... not really sure yet.
When using writers such as the CoNLL writers, should the CAS text be written or the text from the tokens? -- That should be configurable via a PARAM_WRITE_COVERED_TEXT which should be off by default. If the user uses a normalizing segmenter, that should by default be respected.
What to pass to the ML algorithm in trainer components? -- That should be configurable via a PARAM_USE_COVERED_TEXT which should be off by default. If the user uses a normalizing segmenter, that should by default be respected.

Jibun commented 6 years ago

UPF-TALN is very interested in this enhancement because it is essential, for instance, for clitics segmentation. We already implemented a fix for the CoNLL writers but it does not follow the open questions aproach. Is this approach stable enough to be implemented? Is so, we can rewrite our fix and issue a pull-request for it.

On the other hand, we have found that MateTools Lemmatizer and Morph Tagger are also affected by this issue. They both use JCasUtil.toText(Iterable iterable) function which returns covered text instead of the forms. We fixed this by replacing JCasUtil.toText(Iterable iterable) function by a separate loop over all tokens calling token.getText() to obtain the forms.

reckart commented 6 years ago

@Jibun how did you handle the CoNLL files?

Jibun commented 6 years ago

@reckart We changed CoNLL writers to use getText() instead of getCoveredText()

reckart commented 6 years ago

@Jibun that's what the "open questions" section suggest with the addition that it should be possible to switch back to using getCoveredText() by setting a to-be-introduced parameter :)

I'm not sure what you mean by "stable enough". It it somewhat difficult to tell if switching to using getText() everywhere will cause trouble for some users and if it may be necessary to introduce a parameter to control whether getText() or getCoveredText() should be used.

I assume that using getText() is the best option general and there are probably few cases where such an option is needed. Maybe it should also be done in the writers by default - I wasn't sure here. My problem is, that I have hit this issue mainly in discussions with users, but never in actual data I process. What is your opinion? Should getText() always be used and getCoveredText() basically never?

Jibun commented 6 years ago

@reckart We totally agree that getText() should be the default action.

Anyway, we are not eager to discard the getCoveredText() option, so adding PARAM_WRITE_COVERED_TEXT parameter seems the right thing to do, but taking into account retrocompatibility, we vouch for it to be on by default.

reckart commented 6 years ago

@Jibun fine by me. I though it might cause confusion with the user if it is on by default because the POS/lemma/whatnot might deviate from what the user would expect seeing only the covered text and not the normalized text. But that's just an intuition. I gather you have collected some actual experience by now and have better insight into this.

dkpro / dkpro-core

Introduce "form" feature on tokens (1.12.0) #1168