Closed rhdunn closed 1 year ago
I don't think letters are numbers, even if they are used in a place in which a number could be used (but not just letters and numbers, it could be also bullets and various symbols).
I think the above holds globally, but since you specifically refer to the Czech documentation: We definitely do not need NumForm=Alpha
in Czech because letters marking list items are now tagged NOUN
. It may be debatable as well, but I suppose the rationale is that "a name of a letter" (which is how the token is pronounced) is a noun. See also this query: http://hdl.handle.net/11346/PMLTQ-3TUV
If they are names (of a letter), why not PROPN?
I agree with @dan-zeman that these are not numbers. There are languages that truly use letters as numbers, mostly ancient languages (Coptic, Biblical Hebrew, Ancient Greek etc.). In those cases I would see the case for a NumType like this, but not for English list item markers. In such languages you can say things like "x years" and it means "60 years" (since the numerical value of "x" is 60):
As has been pointed out, list markers can indicate hierarchy and contain things like "A.iii.)", which is also not the name of a letter. I don't think these things should be tokenized apart either - we can see those are not real parentheses, because they do not match (there is no opening "(" corresponding to the closing one). In sum, I think this actually shows some foresight from PTB in giving them a unique xpos tag "LS", and I suppose the closest UD equivalent is X.
If someone wants to argue that pure number cases should be tagged as NUM and given features, I could live with that, though I wonder if it's worth the complication. But calling "A.iii.)" a number or the name of a letter seems odd to me.
So we have the following forms as single tokens:
(a)
, a)
, a.
, 1.a)
, etc. (letter-like or mixed) as UPOS X
and XPOS LS
;(ii)
, ii)
, ii.
, iv.ii)
, etc. (roman numeral) as UPOS NUM
, XPOS LS
, and NumType=Card|NumForm=Roman
;(1)
, 1)
, 1.
, 1.2)
, etc. (number) as UPOS NUM
, XPOS LS
, and NumType=Card|NumForm=Digit
?OK, you've convinced me that "(a)" is not a true number and not the name of the letter. But I do think it functions as a metalinguistic name for a textual unit. After having established what "(a)" refers to, you can then use it as a nominal: "In (a), we see that..." or "Strike (a) and replace it with...". Moreover, you might group together a few letters in the sequence, triggering plural agreement: "items (a-c)", "(a-c) are redundant".
The X
guidelines suggest it should be used for foreign or garbage words that cannot be assigned to an ordinary category. Could "(a)" ever be non-nominal? I can't think of how, and seeing as it doesn't normally take determiners and is a somewhat open-class way to refer to a unique entity, I think PROPN
could make sense.
As for true numbers, because any (non-fractional) cardinal number can in principle be used to identify where something is in a sequence, I would say that what PTB calls LS is properly understood as a grammatical function fulfilled by a NUM
or something like "(a)". (Of course PTB lacks full-fledged grammatical functions like we have in UD.) As to what that function should be called—GUM uses dep
, and the query @dan-zeman posted above shows nmod
. I personally think dep
is too vague for something so well-attested/productive; maybe some flavor of nmod
/obl
is our best bet, or right-headed appos
.
Well, we know what the head is, and we don't know what to call it... I thought that's more or less the definition of dep 😄
I suppose we could go with advmod if you really prefer that (like "firstly", "secondly" etc.), but it does muddy the meaning of the label a bit. I'm honestly not so unhappy with dep.
It seems like there should be a way for people to search crosslinguistically for list item markers since it's a distinctive sort of relation that I think most people are familiar with (and not super rare). Plain dep
conflates many things, but I could imagine dep:li
for example.
NumType deserves development. Will «to the n'th power» then involve NumType=AlphaOrd?
Would dep:li
mean that for a sentence like "Chapter I. The Cyclone" the "Chapter I." part would be attached to the "The Cyclone" part via the dep:li
relation?
Would
dep:li
mean that for a sentence like "Chapter I. The Cyclone" the "Chapter I." part would be attached to the "The Cyclone" part via thedep:li
relation?
Interesting question...the relation between a textual part identifier ("Chapter I") and the title of that part is not quite the same as a list marker. I'm not sure "Chapter I" is really a modifier. I wonder if appos
would make more sense there.
I'm not sure "Chapter I" is really a modifier. I wonder if appos would make more sense there.
I think it's the same as "page 5", and falls under the same category as what we've called nmod:desc
in the mischievous nominals paper. It has the same coordination properties, e.g. pages 5-10, chapters 3-4 etc.
Plain dep conflates many things, but I could imagine dep:li for example.
We can talk about it more, but it's getting late for introducing changes like this into the current release, so I would suggest putting a pin in it for now...
"Chapter I": sorry I meant the phrase as a whole is not a modifier. Yes, the relation between "Chapter" and "I" is what we call nmod:desc
.
oh, yes, I see
The current https://universaldependencies.org/cs/feat/NumForm.html does not have an option for labelling alphabetic lists such as in "a), b), c)". This is needed for list tokens in the English treebanks, but can also apply to other treebanks that use letters in the target language's alphabet.
These could be supported by a new
NumType=Ord|NumForm=Alpha
feature label, whereNumForm=Alpha
is defined as:Alpha: alphabetic numeral
Examples: