Open reckart opened 6 years ago
UPF-TALN is very interested in this enhancement because it is essential, for instance, for clitics segmentation. We already implemented a fix for the CoNLL writers but it does not follow the open questions aproach. Is this approach stable enough to be implemented? Is so, we can rewrite our fix and issue a pull-request for it.
On the other hand, we have found that MateTools Lemmatizer and Morph Tagger are also affected by this issue. They both use JCasUtil.toText(Iterable
@Jibun how did you handle the CoNLL files?
@reckart We changed CoNLL writers to use getText() instead of getCoveredText()
@Jibun that's what the "open questions" section suggest with the addition that it should be possible to switch back to using getCoveredText()
by setting a to-be-introduced parameter :)
I'm not sure what you mean by "stable enough". It it somewhat difficult to tell if switching to using getText()
everywhere will cause trouble for some users and if it may be necessary to introduce a parameter to control whether getText()
or getCoveredText()
should be used.
I assume that using getText()
is the best option general and there are probably few cases where such an option is needed. Maybe it should also be done in the writers by default - I wasn't sure here. My problem is, that I have hit this issue mainly in discussions with users, but never in actual data I process. What is your opinion? Should getText()
always be used and getCoveredText()
basically never?
@reckart We totally agree that getText()
should be the default action.
Anyway, we are not eager to discard the getCoveredText()
option, so adding PARAM_WRITE_COVERED_TEXT
parameter seems the right thing to do, but taking into account retrocompatibility, we vouch for it to be on by default.
@Jibun fine by me. I though it might cause confusion with the user if it is on by default because the POS/lemma/whatnot might deviate from what the user would expect seeing only the covered text and not the normalized text. But that's just an intuition. I gather you have collected some actual experience by now and have better insight into this.
This is a follow-up of #953
Add a new feature to the Token to represent the "form" of a token. However, a tokenizer may choose to set this feature differently to establish a basic normalization without having to resort to actually materializing this normalization in the underlying text. In particular tokenizers that do context sensitive normalization might profit from this, e.g. PTB quote normalization where the left/right context of the quote needs to be taken into account to identify it as an opening or closing quote.
Normally, this is the underlying text. In order to save space, it would be conceivable to implement a custom getter for this that returns getDocumentText() if the feature is not explicitly set. Likewise, the setter would set the feature internally to null if the form corresponds to getDocumentText().
Open questions:
PARAM_WRITE_COVERED_TEXT
which should be off by default. If the user uses a normalizing segmenter, that should by default be respected.PARAM_USE_COVERED_TEXT
which should be off by default. If the user uses a normalizing segmenter, that should by default be respected.