UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

UPOS "X" #440

Closed nschneid closed 2 weeks ago

nschneid commented 9 months ago

The guidelines at https://universaldependencies.org/u/pos/X.html say it should be used very restrictively.

Setting aside the usage with goeswith dependents, we have:

GUM XPOS doesn't use ADD or AFX (these are more recent additions to the PTB tagset). But I see internet addresses under PROPN in GUM, which makes sense linguistically.

I think steps here are:

  1. [ ] Harmonize treatment of LS list markers
  2. [x] Map EWT ADD to PROPN instead of X, and move guidelines examples from SYM (UniversalDependencies/docs#973)
  3. [ ] Review separated affixes and assign a POS based on the kind of modification, typically ADJ or ADV (#152)
  4. [x] Come up with a coherent EWT policy for filenames (e.g. flat or goeswith, and what to do about transparent syntax within parts of filenames) (UniversalDependencies/docs#666)
  5. [x] Clarify UPOS policy for flat:foreign structures (maybe individual words should be X and there should be an ExtPos)
    • UniversalDependencies/docs#1001 clarifies the policy; decided not to use the subtype for English. Need to check whether the policy is implemented consistently.
amir-zeldes commented 8 months ago

list item numbers should be NUM

This is definitely not right, because LS is also the tag for graphical bullets, which are in no way numbers. I'm also not sure that "A1.iii)" is a number, I'd say it's much more of an X. I see some mention of using either PUNCT/punct or SYM/dep for these. In GUM xpos=LS is always attached as dep, and nummod is only used for counting things.

nschneid commented 8 months ago

This is definitely not right, because LS is also the tag for graphical bullets, which are in no way numbers.

https://universaldependencies.org/u/pos/SYM.html says bullets are PUNCT. It seems to be distinguishing them from list item markers with a (quasi)numerical component (i.e., they reflect a position in a sequential ordering of some kind).

I could also imagine thinking of lists as a type of coordination, and these as helping to mark how a list item relates to other items in the list, so CCONJ. But that may be unpopular. :)

amir-zeldes commented 8 months ago

I'm not so convinced. I think syntactically there is no difference between numerical, graphical, alphabetical and mixed list item markers. It's all the same kind of orthographic device, and I would like them to have the same analysis. I wouldn't feel too bad about punct, but then we are not allowed to treat them as kinds of numbers morphologically, and in any case it would create an uncomfortable situation where punctuation becomes open ended.

Tagging them all as SYM, or even splitting them into SYM for non-numerical and NUM for numerical would be OK for me too, but I think they should have the same deprel regardless of what kind of list item marker they are.

nschneid commented 2 weeks ago

LS issue --> #465 AFX issue --> #152

So I think we're done here.