Open IsraelLand opened 2 years ago
These examples are great, sorry for the slow response! I was actually talking about mathy stuff like this with @nschneid recently, and @lauren-lizzy-levine is writing a paper about UD treebanking math expressions, so here's a shout out to both of you in case you want to take a look!
Concretely about these cases:
Semantically it's most like an attribute (minus=negative?), but syntactically we can see it's not amod
in Hebrew, since that should come after its head. For complex numbers, like "five thousand" etc., many TBs use compound, so maybe that's the most neutral way of describing it. I like 20 being the head and not using flat, since it keeps some analogy to "negative 20". If you want to use the Hebrew-specific :affix
subtype that's a possibility, maybe similar to "תת"?
¼
). It's not totally out of the question to just not tokenize it. But if you're worried you'll get things like "1/(a+b)" later, then yes, you would need some description of what to do with division. In Hebrew, "xelkei" is a noun in construct state, so compound would be appropriate (unlike English "divided by", which is an acl
). If you want to read it as "arba'a revaim", then maybe nummod
actually is correct, but in something like "four quarters" there is no /
, so that would probably be better as SYM (maybe attached as dep
) or PUNCT/punct
(only in the case where you want to read "arba'a reva'im", so it's punctuation by virtue of being totally unpronounced). Overall I would prefer SYM to PUNCT. Any other opinions anyone? @lauren-lizzy-levine what are your thoughts on division for that paper?
Thanks!
I agree with the analogy to negative 20, this is exactly it. The thing is, shouldn't we see -20 as it's own seperate number (i.e. not a variation of 20)? I'm not sure I'd go as far as to not segment - from 20, but I can still see them as one unit, if we use flat (-, 20). "Mathematically" I think this is more correct, -20 being its own number seperate from 20 (while still continuous...). "Linguistically" and intuitively obviously -20 is some sort of a subset of 20, so I'm not sure what should we favor here. minus spelled out hints at the latter, but we wouldn't want 2 seperate taggings, one for minus 20, another for -20...
Right. So you'd prefer nummod, with / as SYM depreled with dep?
Thank you
- I agree "minus" is probably best described as a NOUN in Hebrew, esp. when spelled out (so it's not SYM). You can stick a definite article on it in other contexts as well. Using SYM on a spelled out word like this looks odd to me.
The SYM guidelines are not completely clear. They say "Mathematical operators form another group of symbols.", but it is unclear whether that covers mathematical operators written as words. I would guess that it should not—SYM should be restricted to orthographic symbols, and the statement "A symbol is a word-like entity that differs from ordinary words by form, function, or both." should be modified accordingly.
Relatedly, SYM gives email addresses and URLs as examples. In practice these are tagged as X in EWT and PROPN in GUM. So we have some work to do.
Is "מינוס 20%" an example of סמיכות? If you were discussing an incorrect figure, could you say that "מינוס ה20% לא נכון"? In English, unary "minus" has been argued to be a preposition. "A plus or a minus" would be a coerced NOUN. I don't know if these considerations apply to Hebrew though.
I think you need to first decide if you tokenize it at all
Tokenization of fractions is also an issue in English: UniversalDependencies/UD_English-EWT#337
Is "מינוס 20%" an example of סמיכות? If you were discussing an incorrect figure, could you say that "מינוס ה20% לא נכון"?
Sounds awkward to me, opting for "המינוס 20% לא נכון" as one -20 unit. Perhaps in very forced, prescriptive language. I don't feel it's a smixut in the original "מינוס 20%" either, as opposed to "מינוס המינוסים" "minus of minuses", which definitely would be a smixut, similar to your example.
Part of the reason I'm not sure how conjoined they are, and if there's a head (if there is, it probably won't be "minus"), because otherwise, in this environment, you'd think this a regular smixut/prefix, but minus seems a different breed for many reasons.
Agreed on SYM. I also don't think "מינוס 20%" is smixut... And I think the head structure is the inverse of a standard smixut, since Hebrew is head-first. Also the proper head here is "%", so -20 is at most a complex numeral, in which case I think we can still use compound (in the UD sense), but without it being smixut. For other head-last compounds UD Hebrew has used compound:affix
, so maybe that fits here too (at least it differentiates it from smixut).
Right, if you see it as a head structure anyway, meaning we don't "switch over" to some numerical representation, I think we should do compound:affix, the head being 20, - modifying it, in line with "twenty five" and so on, as well.
compound:affix is also pretty flexible, if we look at pluralizations, it encompasses both "תת אלוף" (the rank) - "תתי אלופים", as well as "פוסט טראומה" (PTSD) - "פוסט טראומות" - which if we had to pluralize -20, would be most similar to, "המינוס 20ים", never "מינוסי ה20*", so I don't see why not use it here.
I agree, and "המינוס 20ים" is a good illustration of the headedness. I think putting this into the compound:affix
bin is the best decision.
Hi @amir-zeldes
I'm sure this has been discussed in some form before, admittedly, Hebrew might complicate this further.
minus 20% - former HTB's sole instance of this weird phrasing is nmod between minus and 0, so that's no good imo. So either a compound (minus, 20) or compound:affix (20, minus), or if we assume a flat relation, flat (minus, 20) or goeswith.
Shira convinced me that's more of a flat relation, and I don't think goeswith is right here, so flat. As we have a slew of validation issues to solve regarding this, I'd like to know you also think that's the best one beforehand. Also, minus is NOUN?
Four on the floor, it seems the right reading "out loud" is "miktzav arba(a) revaim". I have no idea how to represent this, the \ is not uttered like "xelkei", so no SYM, but that's still numerical representation, inside of a Hebrew sentence. So either compound (4, 4), a "simple" one with no Definite=Cons, or nummod (4, 4) - if we assume " arbaa revaim" then the first 4 nummodifies the second. That said, the second 4 represents "revaim", not really the number 4, so I think it's not right.
Thank you