IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

Tokens with upos X and deprel advmod #54

Open NathanD38 opened 2 years ago

NathanD38 commented 2 years ago

@amir-zeldes

In our recent meeting regarding expressions with foreign syntax, your instruction was to tag each token as upos=X with deprel flat from the first token. This applies to expressions such as סטטוס קוו.

The following sentence contains the Latin expressions "de facto" and "de jure" written in Hebrew, each of which I tagged as upos=X with deprel flat from the first token.

ממנה ניתן לעלות לכנסייה העליונה (גם היא בבעלות הקופטים דה יורה אולם בידי האתיופים דה פקטו), כנסיית ארבע החיות.

But what is the deprel to those expressions? It seems they function as ADV here, so they should receive advmod (ba'alut, de); (yedey, de) [I think בידי here is not an ADP].

The UD standard rules, however, require that only ADV should receive advmod.

Another example from @Hilla-Merhav involves code switching to an expression that also functions as ADV:

תגיעי לפה as soon as you can!

What should we do in such cases? Should the UD standard rule enforcing advmod for only tokens with upos=ADV have a caveat for tokens with upos=X?

amir-zeldes commented 2 years ago

You are absolutely correct in identifying this problem, and in fact, in English what we have done so far to circumvent it is we have an xpos (FW) which indicates foreign words, which is normally paralleled by upos X, except if the deprel is advmod, in which case upos is automatically converted to ADV in order to satisfy the validator. I agree that this is not optimal, so let me ping @dan-zeman :

I we have a foreign expression like "de facto" in Hebrew, there is no way to analyze it using Hebrew syntax and it is definitely not a native adverb. However it is also definitely being used as advmod. Should the validator just be adapted to allow advmod on upos=X, or is there a different solution you would recommend?

dan-zeman commented 2 years ago

I would use fixed(de, facto). Then the whole can be attached as advmod regardless the UPOS tag of the technical head of the fixed phrase.

amir-zeldes commented 2 years ago

I'm not sure whether or not "de facto" should be fixed - there are potentially limitless expressions we could borrow from Latin, and I like for fixed to be a closed list. Would it be fixed if I say "ceteris paribus et suppositis supponendis"? I think if we are not doing foreign language syntax, then UD best practices suggest using flat, right?

And we also have to consider the situation that there is a single word token that is clearly foreign, but used adverbially... A single token version of @Hilla-Merhav 's example would do it:

dan-zeman commented 2 years ago

Single token can be tagged ADV. In multi-word expressions it may be problem because none of the words is adverb in isolation. But if ASAP is used as adverb in Hebrew, treat it as a loanword.

As for fixed: I recall that the guidelines say something like use it for function words and small adverbials. Not sure how "small" adverbials differ from "large" adverbials but it sounds like it could be used here. For real code switching which is not integrated into the host language, flat(:foreign) is the default, unless you analyze the foreign segment according to the UD guidelines for the source language, which is also possible.

amir-zeldes commented 2 years ago

OK, so I guess my question is mainly for the code switched case: if we use flat(:foreign) then can we tag X and deprel advmod?

dan-zeman commented 2 years ago

OK, so I guess my question is mainly for the code switched case: if we use flat(:foreign) then can we tag X and deprel advmod?

No, that will not work, the validator only checks fixed because flat can be just about anything. The relation advmod is meant for adverbs, not for long phrases; if you know the thing is advmod, you probably know it is ADV rather than X (unknown). But if you have a longer phrase used adverbially, then you might use obl or advcl.

amir-zeldes commented 2 years ago

Hm, this seems a little contradictory: if something is foreign, I should tag it X and give it Foreign in FEATs; but if it happens to be adverbial, then I have no choice but to tag it ADV so I can assign it advmod (but then I can't use Foreign, which should have upos=X); but if it's multiword, I should deprel it fixed (even if it is not a fixed expression) so that I can attach it as advmod (for example "de facto"), even though normally foreign syntax would be "flat", and then I can tag it X again.

All of this seems to me like just artefacts of the validation process, rather than a linguistically meaningful choice. I would prefer for anything that annotators take to be foreign (not an integrated loan) to be tagged X and given Foreign in feats, regardless of the number of tokens. And if it has multiple words, but is not an established fixed expression from a closed list, then deprel should be flat(:foreign), and the external dependency should be determined by function - this is also what we do for phrasal compounds, where a sentence may have internal syntax, but if it's a compound modifier then the deprel is compound (like "devil may care attitude").

NathanD38 commented 2 years ago

@amir-zeldes For now, I've assigned dep to the token דה (de) in דה פקטו (de facto) and דה יורה (de jure). I've added a token-level comment stating that the deprel should be advmod.

@dan-zeman This was one such example of a borrowed Latin phrase used adverbially. I don't think we should further inflate our fixed list with every phrase we would encounter down the line. I also don't think that choosing a non-linguistically motivated deprel (e.g., obl or advcl) in this case would benefit us, aside from satisfying the validator's demands.

So we are left with the problem at hand:

dan-zeman commented 2 years ago

I don't think Foreign=Yes implies X. I have been using it with foreign words that got their UPOS and features according to the foreign grammar, and Lang=xx in MISC indicated what foreign language it was.

NathanD38 commented 2 years ago

@dan-zeman So, in cases where we have a foreign sequence, e.g.,

Plutarch attributed the following phrase to Julius Caesar: "veni, vidi, vici".

each of the Latin verbs will be tagged VERB, with the features Foreign=Yes and Lang=la.

But do we give them the features used in Classical Latin? lemma=venio|video|vinco, Aspect=Perf, Mood=Ind, Number=Sing, Person=1, Tense=Past, VerbForm=Fin, Voice=Act

dan-zeman commented 2 years ago

But do we give them the features used in Classical Latin?

Yes. In general, there are the following options:

NathanD38 commented 2 years ago

@dan-zeman @amir-zeldes I'm not sure I understand what the correct treatment is of such sequences. It seems that the first option will cause the validator to think the verbs are foreign in Latin, not in English, simply by the addition of Foreign=Yes in Feats and Lang=la in Misc. On the other hand, if you only specify Lang=la, will the validator accept that as a legimate option to indicate this is a foreign sequence?

[As a side note, what is the phrase-external deprel here? Do we consider this parataxis(attributed, veni)?

The second option is great for things like de facto, de jure, but the validator won't accept the phrase-external deprel as advmod, because we use upos X and not ADV.

dan-zeman commented 2 years ago

I'm not sure I understand what the correct treatment is of such sequences. It seems that the first option will cause the validator to think the verbs are foreign in Latin, not in English, simply by the addition of Foreign=Yes in Feats and Lang=la in Misc. On the other hand, if you only specify Lang=la, will the validator accept that as a legimate option to indicate this is a foreign sequence?

If I recall it correctly, the validator does not need to see Foreign=Yes in order to decide how to test the word. It only looks for Lang= in MISC. If Foreign=Yes is present, the validator "thinks that the verbs are foreign in Latin" only in the sense that it checks whether Foreign=Yes is a feature-value pair approved in Latin for the given UPOS category. I have been using Foreign=Yes this way when fixing annotations, and it did not occur to me that I was actually saying that the word is foreign in the other language, until this thread. On one hand, it would seem useful to have a rule that Foreign=Yes is an exception which is always interpreted with respect to the host language, and which should always be present when Lang=... is present. It would allow us to quickly filter out foreign word forms when collecting lexical data from the corpus. On the other hand, such a rule would only make sense if there is a single host language. But we also have code-switching corpora where none of the languages is dominant enough. So perhaps the rule should say that Foreign=Yes is interpreted with respect to the host language, but is only used if there is a single dominant language in the corpus. (This could be actually validated because we have only a few corpora that are declared as code-switching, and they use a special language code.)

[As a side note, what is the phrase-external deprel here? Do we consider this parataxis(attributed, veni)?

In my opinion, veni depends on phrase, despite the non-projectivity. I think I would label the relation appos, although parataxis and acl would also work.

The second option is great for things like de facto, de jure, but the validator won't accept the phrase-external deprel as advmod, because we use upos X and not ADV.

Yes. I still don't think that we have to open the validator for X+advmod (which would reduce the amount of errors caught elsewhere). First, advmod is normally used for single-word adverbs. If there are multiple tokens but we want to say they are one syntactic word, we can use fixed (that is okay for function words and for adverbs). And if we do not want to use that, then we can use obl instead of advmod (that would IMHO match the Latin de facto), or advcl if the phrase contains a predicate.

amir-zeldes commented 2 years ago

As a side note, what is the phrase-external deprel here? Do we consider this parataxis(attributed, veni)?

I think parataxis sounds better because it's not quite an apposition: not reversible and not adjacent (... to Caesar)

we also have code-switching corpora where none of the languages is dominant

I think just Foreign does not imply code-switching; code-switching can be identified and classified using CSType in MISC, see section 4.7 here:

https://arxiv.org/pdf/2011.02063.pdf