Closed nschneid closed 6 months ago
For spoken corpora, X is used for unfinished words (scraps? false start? I am not sure how you call that in English). But we are inconsistent: most of the time we can figure out what will be the complete word and we use the POS of the reapir. We hesitated between two strategies:
I think I prefer Solution 2 because, even if "a~" is repaired by "after" and I know that "a~" was used here as the start of an ADP, I don't want to have "a~" among the ADPs of my corpus.
In our corpora of spoken French it is incoherent and we should take a clear decision See https://universal.grew.fr/?custom=657ad60d136b4. We use "~" to indicate unfinished words, because "-" is used in orthographic words. It would be easy to change the annotation with a Grew rule as soon as we have decided what to do.
I would like the definition to stress more that this POS (non-)tag should really be a last resort and that it is actually a non lexcial one, similarly as for dep
.
@sylvainkahane For words truncated/unfinished due to a dysfluency, my gut feeling is that X
would make sense, falling under the word fragment subcase. There are also uses of reparandum
where a word is repeated, and there I would expect the regular tag to apply on both tokens.
@Stormur "It should be used very restrictively." seems to say that...are you seeing places where it is overused?
Maybe I am nitpicking, but it seems to leave space for creating own restrictions, which might be arbitrarily large as we know, instead of specifying that it is really the last thing you should do if there is no other possibility.
If there is general agreement I would be open to adding a sentence along the lines of "If the word is deemed a 'real' word of the language, then another tag should be used, even if that word's morphosyntactic behavior is unusual."
Thanks @Stormur: the group agreed to emphasize that it should be used narrowly. Updated https://universaldependencies.org/u/pos/X.html
And @sylvainkahane it now mentions truncated words. I think I agree with you about ExtPos being the right place for the intended word POS if it can be determined.
Thanks @nschneid. I will adopt the POS X for all truncated words in our spoken corpora and add an ExtPos feature with the POS of the expected word.
I think a few cases where
X
is appropriate can be spelled out in more detail (and the code-switching case should be updated in light of #1001). Will implement this as it shouldn't be too controversial, but feel free to weigh in here.