UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Elaborate documentation of X tag #1005

Closed nschneid closed 6 months ago

nschneid commented 11 months ago

I think a few cases where X is appropriate can be spelled out in more detail (and the code-switching case should be updated in light of #1001). Will implement this as it shouldn't be too controversial, but feel free to weigh in here.

sylvainkahane commented 11 months ago

For spoken corpora, X is used for unfinished words (scraps? false start? I am not sure how you call that in English). But we are inconsistent: most of the time we can figure out what will be the complete word and we use the POS of the reapir. We hesitated between two strategies:

  1. using the POS of the corrected word (the repair) when we can figure it out.
  2. using X everytime and put the POS of the corrected word in ExtPos when we can figure it out.

I think I prefer Solution 2 because, even if "a~" is repaired by "after" and I know that "a~" was used here as the start of an ADP, I don't want to have "a~" among the ADPs of my corpus.

In our corpora of spoken French it is incoherent and we should take a clear decision See https://universal.grew.fr/?custom=657ad60d136b4. We use "~" to indicate unfinished words, because "-" is used in orthographic words. It would be easy to change the annotation with a Grew rule as soon as we have decided what to do.

Stormur commented 11 months ago

I would like the definition to stress more that this POS (non-)tag should really be a last resort and that it is actually a non lexcial one, similarly as for dep.

nschneid commented 11 months ago

@sylvainkahane For words truncated/unfinished due to a dysfluency, my gut feeling is that X would make sense, falling under the word fragment subcase. There are also uses of reparandum where a word is repeated, and there I would expect the regular tag to apply on both tokens.

@Stormur "It should be used very restrictively." seems to say that...are you seeing places where it is overused?

Stormur commented 11 months ago

Maybe I am nitpicking, but it seems to leave space for creating own restrictions, which might be arbitrarily large as we know, instead of specifying that it is really the last thing you should do if there is no other possibility.

nschneid commented 11 months ago

If there is general agreement I would be open to adding a sentence along the lines of "If the word is deemed a 'real' word of the language, then another tag should be used, even if that word's morphosyntactic behavior is unusual."

nschneid commented 10 months ago

Thanks @Stormur: the group agreed to emphasize that it should be used narrowly. Updated https://universaldependencies.org/u/pos/X.html

nschneid commented 10 months ago

And @sylvainkahane it now mentions truncated words. I think I agree with you about ExtPos being the right place for the intended word POS if it can be determined.

sylvainkahane commented 10 months ago

Thanks @nschneid. I will adopt the POS X for all truncated words in our spoken corpora and add an ExtPos feature with the POS of the expected word.