UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

NumForm feature annotation for alphabetic list/number forms #983

Closed rhdunn closed 1 year ago

rhdunn commented 1 year ago

The current https://universaldependencies.org/cs/feat/NumForm.html does not have an option for labelling alphabetic lists such as in "a), b), c)". This is needed for list tokens in the English treebanks, but can also apply to other treebanks that use letters in the target language's alphabet.

These could be supported by a new NumType=Ord|NumForm=Alpha feature label, where NumForm=Alpha is defined as:

Alpha: alphabetic numeral

Examples:

dan-zeman commented 1 year ago

I don't think letters are numbers, even if they are used in a place in which a number could be used (but not just letters and numbers, it could be also bullets and various symbols).

I think the above holds globally, but since you specifically refer to the Czech documentation: We definitely do not need NumForm=Alpha in Czech because letters marking list items are now tagged NOUN. It may be debatable as well, but I suppose the rationale is that "a name of a letter" (which is how the token is pronounced) is a noun. See also this query: http://hdl.handle.net/11346/PMLTQ-3TUV

nschneid commented 1 year ago

If they are names (of a letter), why not PROPN?

amir-zeldes commented 1 year ago

I agree with @dan-zeman that these are not numbers. There are languages that truly use letters as numbers, mostly ancient languages (Coptic, Biblical Hebrew, Ancient Greek etc.). In those cases I would see the case for a NumType like this, but not for English list item markers. In such languages you can say things like "x years" and it means "60 years" (since the numerical value of "x" is 60):

https://annis.copticscriptorium.org/annis/scriptorium#_q=cG9zPSJOVU0iIF89XyB0b2s9IuKynSI&_c=Y29wdGljLnRyZWViYW5r&cl=5&cr=5&s=0&l=10&_seg=bm9ybV9ncm91cA

As has been pointed out, list markers can indicate hierarchy and contain things like "A.iii.)", which is also not the name of a letter. I don't think these things should be tokenized apart either - we can see those are not real parentheses, because they do not match (there is no opening "(" corresponding to the closing one). In sum, I think this actually shows some foresight from PTB in giving them a unique xpos tag "LS", and I suppose the closest UD equivalent is X.

If someone wants to argue that pure number cases should be tagged as NUM and given features, I could live with that, though I wonder if it's worth the complication. But calling "A.iii.)" a number or the name of a letter seems odd to me.

rhdunn commented 1 year ago

So we have the following forms as single tokens:

  1. (a), a), a., 1.a), etc. (letter-like or mixed) as UPOS X and XPOS LS;
  2. (ii), ii), ii., iv.ii), etc. (roman numeral) as UPOS NUM, XPOS LS, and NumType=Card|NumForm=Roman;
  3. (1), 1), 1., 1.2), etc. (number) as UPOS NUM, XPOS LS, and NumType=Card|NumForm=Digit?
nschneid commented 1 year ago

OK, you've convinced me that "(a)" is not a true number and not the name of the letter. But I do think it functions as a metalinguistic name for a textual unit. After having established what "(a)" refers to, you can then use it as a nominal: "In (a), we see that..." or "Strike (a) and replace it with...". Moreover, you might group together a few letters in the sequence, triggering plural agreement: "items (a-c)", "(a-c) are redundant".

The X guidelines suggest it should be used for foreign or garbage words that cannot be assigned to an ordinary category. Could "(a)" ever be non-nominal? I can't think of how, and seeing as it doesn't normally take determiners and is a somewhat open-class way to refer to a unique entity, I think PROPN could make sense.

nschneid commented 1 year ago

As for true numbers, because any (non-fractional) cardinal number can in principle be used to identify where something is in a sequence, I would say that what PTB calls LS is properly understood as a grammatical function fulfilled by a NUM or something like "(a)". (Of course PTB lacks full-fledged grammatical functions like we have in UD.) As to what that function should be called—GUM uses dep, and the query @dan-zeman posted above shows nmod. I personally think dep is too vague for something so well-attested/productive; maybe some flavor of nmod/obl is our best bet, or right-headed appos.

amir-zeldes commented 1 year ago

Well, we know what the head is, and we don't know what to call it... I thought that's more or less the definition of dep 😄

I suppose we could go with advmod if you really prefer that (like "firstly", "secondly" etc.), but it does muddy the meaning of the label a bit. I'm honestly not so unhappy with dep.

nschneid commented 1 year ago

It seems like there should be a way for people to search crosslinguistically for list item markers since it's a distinctive sort of relation that I think most people are familiar with (and not super rare). Plain dep conflates many things, but I could imagine dep:li for example.

rueter commented 1 year ago

NumType deserves development. Will «to the n'th power» then involve NumType=AlphaOrd?

rhdunn commented 1 year ago

Would dep:li mean that for a sentence like "Chapter I. The Cyclone" the "Chapter I." part would be attached to the "The Cyclone" part via the dep:li relation?

nschneid commented 1 year ago

Would dep:li mean that for a sentence like "Chapter I. The Cyclone" the "Chapter I." part would be attached to the "The Cyclone" part via the dep:li relation?

Interesting question...the relation between a textual part identifier ("Chapter I") and the title of that part is not quite the same as a list marker. I'm not sure "Chapter I" is really a modifier. I wonder if appos would make more sense there.

amir-zeldes commented 1 year ago

I'm not sure "Chapter I" is really a modifier. I wonder if appos would make more sense there.

I think it's the same as "page 5", and falls under the same category as what we've called nmod:desc in the mischievous nominals paper. It has the same coordination properties, e.g. pages 5-10, chapters 3-4 etc.

Plain dep conflates many things, but I could imagine dep:li for example.

We can talk about it more, but it's getting late for introducing changes like this into the current release, so I would suggest putting a pin in it for now...

nschneid commented 1 year ago

"Chapter I": sorry I meant the phrase as a whole is not a modifier. Yes, the relation between "Chapter" and "I" is what we call nmod:desc.

amir-zeldes commented 1 year ago

oh, yes, I see