IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

(Non) definite structure involving (temporals,) number expressions #60

Open IsraelLand opened 2 years ago

IsraelLand commented 2 years ago

Hi @amir-zeldes ,

So @strasss found some sort of structure that's pretty common in colloquial speech, which we weren't sure what to do with. I think Omer eventually opted for compound, but since I keep stumbling upon it now that I've seen it, I'd like your input.

יום א' ניתן ליצור קשר בימים א'-ה'

This is obviously a SYM obl case, but how should we link yamim to aleph? Should yom/im be a noun/propn? bayamim aleph ad he is a no brainer, mostly, as we'd go for appos. beyamim aleph ad he or beyom aleph is different. It doesn't seem like nmod:desc - those have a hard time being pluralized similarly -

פגשתי אותו בימי שני-שישי \ בימים שני-שישי \ ב(ה)ימים שני-שישי - V גדודי 70 ו-72 נעו לאורך הציר \ גדודים 70 ו-72 נעו לאורך הציר ( \ גדודי 70 נעו לאורך הציר )

This is also because of the nature of gdud 70 being a specific entity, while there are many Thursdays, and the compositionality of [yom [hamishi]]. I think they are appositional (Thu-Fri are interchangeable with yamim) but this is somewhat problematic with the indefinite case, while compound is another option, which I find very general - this is no construct state, anyway, if the construct form ימי isn't used.

TL;DR - ב(ה)ימי(ם) א-ה cases are possible across the board, while other nmod:desc or actual smixut cases aren't. Should all yamim cases & beyamim 1-5 get tagged the same, differently, and if it's desc/appos/compound/another contraption.

Other words that are like that are shaot ( המשרד פתוח בשעות 6-8), gilim ( גיל(אי) 5-6 ) and months to some extent.

Thank you!

amir-zeldes commented 2 years ago

It doesn't seem like nmod:desc - those have a hard time being pluralized similarly

Actually if you look at the nmod:desc paper, we had similar cases with:

See section 3.4 here: https://arxiv.org/pdf/2108.12928.pdf

I agree that indefinites behave differently from appos and compound (no compound form, as you noted - if yes, then probably that's the right analysis). There are also some cases which defy pluralization, like "*Firefoxes 58.0 and 59.0", but on the whole I think nmod:desc is closest for all of them, and we've been using dep for these, in anticipation of the possible introduction of nmod:desc.

amir-zeldes commented 2 years ago

PS - to clarify, I read "hours" as definite, so appos, and if ages appears as "גילאי" then I would go with compound.

IsraelLand commented 2 years ago

Right, so compound for the "explicit smixut" forms (biymei 1-5), appos for the definite forms (bayamim 1-5) and nmod:desc for the indef. forms (beyamim 1-5)? That's assuming "yom aleph" in itself is nmod:desc? (which I don't think is what we've been doing) I hope I got that right, I got a bit confused there, thanks

strasss commented 2 years ago

Would nmod:desc work also with בְּימים שלישי עד חמישי?

amir-zeldes commented 2 years ago

Would nmod:desc work also with בְּימים שלישי עד חמישי?

Yes, that's the same situation described in the paper for "pages 10-20" IMO. It's basically like a part of their name, but it's not flat - we know the "days" part is the head, and the rest is a child syntagm with internal structure (nmod with "until")

IsraelLand commented 2 years ago

Thank you!

You guys can open this if something seems unclear\weird

NathanD38 commented 2 years ago

@amir-zeldes Do you consider גיל 5 or as an example of nmod:desc or simply compound? It appears that this form is an abbreviation of גיל 5 שנים, which is analyzed as compound(gil,shanim) and nummod(shanim,5).

The combination גיל 18 ו-3 חודשים is also an abbreviation of גיל 18 שנים ו-3 חודשים. But, whereas the former calls for nmod:desc/compound(gil, 18), conj(18, xodashim), cc(xodashim, ve), and nummod(xodashim, 3); the latter calls for compound(gil, shanim), nummod(shanim, 18), conj(shanim, xodashim), cc(xodashim,ve) and nummod(xodashim, 3),

That is, we choose a different deprel, despite having essentially the same sequence, albeit with an implicit/elided nominal (here shanim).

What about גיל שנתיים/מאתיים/אלפיים? Are these forms unequivocally compound or can they also be nmod:desc?

@shirawigi argued on Slack that nmod:desc is reserved for phrases which include a quasi-ordinal numbers (e.g., floor 8 vs. the 8th floor), Whereas we can use nmod:desc for 8 in floor 8 because it's in a quasi-ordinal position, 3 in gil 3 (shanim) is actually a cardinal numbers, counting the number of years X has been living on this earth (or simply exists).

  1. Can nmod:desc be given to tokens other than numerals? (e.g., שנתיים)
  2. Should hours also get nmod:desc when they are non-definite? (e.g., בשעה 4 (non-definite be-sha'a))
  3. Should phrases with implicit counted nouns (e.g., שנים) be assigned compound (the numeral promoted for the elided noun)?
amir-zeldes commented 2 years ago

I think some of this depends on your reconstruction of certain ellipses, which are maybe not totally uncontroversial. Let's consider these cases separately:

  1. Can nmod:desc be given to tokens other than numerals? (e.g., שנתיים)

I would say yes, since other items can be co-opted into the same grammatical environments, and I think at least for letters we have discussed this before (so "Plan A"). It's about the morphosyntactic environment, and some non-numerals (strictly speaking) satisfy this too IMO.

  1. Should hours also get nmod:desc when they are non-definite? (e.g., בשעה 4 (non-definite be-sha'a))

I think so - the indefinite case is the 'canonical' desc case. If it were definite, it would be appos, no?

  1. Should phrases with implicit counted nouns (e.g., שנים) be assigned compound (the numeral promoted for the elided noun)?

This is the trickiest one, and ultimately I think yes, but not because we need the reconstructed ellipsis for that. Numerals are already substitutive indefinite NPs, a bit like indefinite pronouns. The important aspects of nmod:desc are that it is non-invertible (therefore not appos), not a monolithic name with opaque internal structure (therefore not flat), and we know what the head is (ages are a type of age, which controls their distribution, so the word "gil" is the head). Before introducing this label, we could only tag these things as dep, or abuse the construction (mis-represent what the head is, pretend it is a compound, or something else). Overall, I think even just a numeral has the same properties, and is not different from "kav 18" (a bus line), which is not flat (it's a sub-type of "kav"), and we can get compositional syntax on the modifier ("kavim 18 ve-21")

shirawigi commented 2 years ago

@amir-zeldes

@shirawigi argued on Slack that nmod:desc is reserved for phrases which include a quasi-ordinal numbers (e.g., floor 8 vs. the 8th floor), Whereas we can use nmod:desc for 8 in floor 8 because it's in a quasi-ordinal position, 3 in gil 3 (shanim) is actually a cardinal numbers, counting the number of years X has been living on this earth (or simply exists).

I would like to further explain this :)

The important aspects of nmod:desc are that it is non-invertible (therefore not appos), not a monolithic name with opaque internal structure (therefore not flat), and we know what the head is (ages are a type of age, which controls their distribution, so the word "gil" is the head).

I read parts of your paper introducing the nmod:desc again, and most (if not all) of the numbered entities examples there can be replaced with an ordinal structure (this satisfies the second aspect of nmod:desc – a clear internal structure) , e.g.: Firefox (version) 58.0 --> 58.0th version of Firefox Richard III – well, that’s simply an orthographic convention (isn’t it?), but it is uttered as Richard the third. World War II --> the second world war (and in Hebrew it is only milxemt ha-olam ha-Sniya) Figure 4 --> the 4th figure pp. 5–10 --> the 5th page to the 10th page Symphony No. 5 --> the 5th Symphony

In contrast, I think "kav 18" and "yexida 8200" don't satisfy the second condition - the internal structure here is indeed opaque. I assume here that you would argue that the internal structure of "kav 18" is an ordinal one (please correct me if I’m wrong), but "kav 18" is not the 18th bus line and "yexida 8200" is not the 8200th yexida, 18 and 8200 are just random numbers, serving as names here. Maybe we can use nummod:name here, as suggested in your article?

As for the current issue – gil 3: I think this is also different from the examples above (Richard III, figure 4 etc…) since the internal relation between ‘gil’ and ‘3’ is not an ordinal one. ‘gil 3’ is NOT equivalent to ‘ha-gil ha-SliSi’. I think 3 here is in a cardinal position, where the object\units it counts are omitted. Hence, I don’t think it should be treated the same as ‘figure 4’ et al. I also think it is different from ‘kav 18’ and ‘yexida 8200’, since it not a name. I’m not sure how we should treat such an ellipsis, but maybe a ‘compound’ would work here – compound(gil, 3) ?

Thanks, Shira

amir-zeldes commented 2 years ago

most (if not all) of the numbered entities examples there can be replaced with an ordinal structure

Sure, but semantic substitutability with an ordinal does not imply what the syntactic structure is. English ordinals are premodifiers, and have certain morphological forms, which these cases don't have (Firefox 58 is syntactically not the same as the 58th Firefox)

the internal structure here is indeed opaque

I think you are saying that semantically, "kav 18" does not have to be the 18th bus line; but we are not saying it has to be. Some books start at p. 3 due to blank pages, "Symphony No. 5" might follow "Symphony No. 3", with no "4", it might have been composed before "3", who knows? And even if some cases line up nicely in ordinal semantics, I don't think that is the deciding factor for the syntactic structure. The point is that it is a naming construction which is not flat, in which we know the lexical item is the head, and the coordination tests show this for all of these:

Even if we decided to use nummod:name (which is mainly a question of label sparseness/how many we want to introduce), I think the same label should apply to all of these cases.