Open NathanD38 opened 2 years ago
You are absolutely correct in identifying this problem, and in fact, in English what we have done so far to circumvent it is we have an xpos (FW
) which indicates foreign words, which is normally paralleled by upos X
, except if the deprel is advmod
, in which case upos is automatically converted to ADV
in order to satisfy the validator. I agree that this is not optimal, so let me ping @dan-zeman :
I we have a foreign expression like "de facto" in Hebrew, there is no way to analyze it using Hebrew syntax and it is definitely not a native adverb. However it is also definitely being used as advmod
. Should the validator just be adapted to allow advmod on upos=X, or is there a different solution you would recommend?
I would use fixed(de, facto)
. Then the whole can be attached as advmod
regardless the UPOS tag of the technical head of the fixed phrase.
I'm not sure whether or not "de facto" should be fixed - there are potentially limitless expressions we could borrow from Latin, and I like for fixed
to be a closed list. Would it be fixed if I say "ceteris paribus et suppositis supponendis"? I think if we are not doing foreign language syntax, then UD best practices suggest using flat
, right?
And we also have to consider the situation that there is a single word token that is clearly foreign, but used adverbially... A single token version of @Hilla-Merhav 's example would do it:
Single token can be tagged ADV
. In multi-word expressions it may be problem because none of the words is adverb in isolation. But if ASAP is used as adverb in Hebrew, treat it as a loanword.
As for fixed
: I recall that the guidelines say something like use it for function words and small adverbials. Not sure how "small" adverbials differ from "large" adverbials but it sounds like it could be used here. For real code switching which is not integrated into the host language, flat(:foreign)
is the default, unless you analyze the foreign segment according to the UD guidelines for the source language, which is also possible.
OK, so I guess my question is mainly for the code switched case: if we use flat(:foreign)
then can we tag X and deprel advmod
?
OK, so I guess my question is mainly for the code switched case: if we use
flat(:foreign)
then can we tag X and depreladvmod
?
No, that will not work, the validator only checks fixed
because flat
can be just about anything. The relation advmod
is meant for adverbs, not for long phrases; if you know the thing is advmod
, you probably know it is ADV
rather than X
(unknown). But if you have a longer phrase used adverbially, then you might use obl
or advcl
.
Hm, this seems a little contradictory: if something is foreign, I should tag it X and give it Foreign in FEATs; but if it happens to be adverbial, then I have no choice but to tag it ADV so I can assign it advmod (but then I can't use Foreign, which should have upos=X); but if it's multiword, I should deprel it fixed (even if it is not a fixed expression) so that I can attach it as advmod (for example "de facto"), even though normally foreign syntax would be "flat", and then I can tag it X again.
All of this seems to me like just artefacts of the validation process, rather than a linguistically meaningful choice. I would prefer for anything that annotators take to be foreign (not an integrated loan) to be tagged X and given Foreign in feats, regardless of the number of tokens. And if it has multiple words, but is not an established fixed expression from a closed list, then deprel should be flat(:foreign)
, and the external dependency should be determined by function - this is also what we do for phrasal compounds, where a sentence may have internal syntax, but if it's a compound modifier then the deprel is compound
(like "devil may care attitude").
@amir-zeldes For now, I've assigned dep
to the token דה (de) in דה פקטו (de facto) and דה יורה (de jure).
I've added a token-level comment stating that the deprel should be advmod
.
@dan-zeman This was one such example of a borrowed Latin phrase used adverbially. I don't think we should
further inflate our fixed list with every phrase we would encounter down the line. I also don't think that choosing
a non-linguistically motivated deprel (e.g., obl
or advcl
) in this case would benefit us, aside from satisfying the validator's
demands.
So we are left with the problem at hand:
X
with flat
syntax, and receive Foreign=Yes
.advmod
to anything other than tokens tagged ADV
(or fixed
MWEs).advmod
as its external deprel.I don't think Foreign=Yes
implies X
. I have been using it with foreign words that got their UPOS and features according to the foreign grammar, and Lang=xx
in MISC indicated what foreign language it was.
@dan-zeman So, in cases where we have a foreign sequence, e.g.,
Plutarch attributed the following phrase to Julius Caesar: "veni, vidi, vici".
each of the Latin verbs will be tagged VERB
, with the features Foreign=Yes
and Lang=la
.
But do we give them the features used in Classical Latin?
lemma=venio|video|vinco, Aspect=Perf, Mood=Ind, Number=Sing, Person=1, Tense=Past, VerbForm=Fin, Voice=Act
But do we give them the features used in Classical Latin?
Yes. In general, there are the following options:
Lang=la
to MISC, which will cause the validator to use Latin validation rules (e.g., what is or is not auxiliary, which features are allowed with which UPOS etc.) Technically it should be possible to use Foreign=Yes
in FEATS as well (that feature should be allowed with any UPOS in any language), although it now occurs to me that it is not clear what exactly the rule should be here: if we say that the features are interpreted as Latin, then one might say that Foreign=Yes
means it is foreign in Latin, not in the surrounding English... I'm not sure what's better here. Phrase-internal deprels are supposedly Latin as well, but the validator currently does not switch to Latin-specific subtypes.X
, Foreign=Yes
as the only feature, and flat:foreign
as the phrase-internal relation.@dan-zeman @amir-zeldes
I'm not sure I understand what the correct treatment is of such sequences.
It seems that the first option will cause the validator to think the verbs are foreign in Latin, not in English,
simply by the addition of Foreign=Yes
in Feats and Lang=la
in Misc.
On the other hand, if you only specify Lang=la
, will the validator accept that as a legimate option to indicate this
is a foreign sequence?
[As a side note, what is the phrase-external deprel here? Do we consider this parataxis
(attributed, veni)?
The second option is great for things like de facto, de jure, but the validator won't accept the phrase-external deprel
as advmod
, because we use upos X
and not ADV
.
I'm not sure I understand what the correct treatment is of such sequences. It seems that the first option will cause the validator to think the verbs are foreign in Latin, not in English, simply by the addition of
Foreign=Yes
in Feats andLang=la
in Misc. On the other hand, if you only specifyLang=la
, will the validator accept that as a legimate option to indicate this is a foreign sequence?
If I recall it correctly, the validator does not need to see Foreign=Yes
in order to decide how to test the word. It only looks for Lang=
in MISC. If Foreign=Yes
is present, the validator "thinks that the verbs are foreign in Latin" only in the sense that it checks whether Foreign=Yes
is a feature-value pair approved in Latin for the given UPOS category. I have been using Foreign=Yes
this way when fixing annotations, and it did not occur to me that I was actually saying that the word is foreign in the other language, until this thread. On one hand, it would seem useful to have a rule that Foreign=Yes
is an exception which is always interpreted with respect to the host language, and which should always be present when Lang=...
is present. It would allow us to quickly filter out foreign word forms when collecting lexical data from the corpus. On the other hand, such a rule would only make sense if there is a single host language. But we also have code-switching corpora where none of the languages is dominant enough. So perhaps the rule should say that Foreign=Yes
is interpreted with respect to the host language, but is only used if there is a single dominant language in the corpus. (This could be actually validated because we have only a few corpora that are declared as code-switching, and they use a special language code.)
[As a side note, what is the phrase-external deprel here? Do we consider this
parataxis
(attributed, veni)?
In my opinion, veni depends on phrase, despite the non-projectivity. I think I would label the relation appos
, although parataxis
and acl
would also work.
The second option is great for things like de facto, de jure, but the validator won't accept the phrase-external deprel as
advmod
, because we use uposX
and notADV
.
Yes. I still don't think that we have to open the validator for X
+advmod
(which would reduce the amount of errors caught elsewhere). First, advmod
is normally used for single-word adverbs. If there are multiple tokens but we want to say they are one syntactic word, we can use fixed
(that is okay for function words and for adverbs). And if we do not want to use that, then we can use obl
instead of advmod
(that would IMHO match the Latin de facto), or advcl
if the phrase contains a predicate.
As a side note, what is the phrase-external deprel here? Do we consider this parataxis(attributed, veni)?
I think parataxis sounds better because it's not quite an apposition: not reversible and not adjacent (... to Caesar)
we also have code-switching corpora where none of the languages is dominant
I think just Foreign does not imply code-switching; code-switching can be identified and classified using CSType
in MISC, see section 4.7 here:
@amir-zeldes
In our recent meeting regarding expressions with foreign syntax, your instruction was to tag each token as
upos=X
with deprelflat
from the first token. This applies to expressions such as סטטוס קוו.The following sentence contains the Latin expressions "de facto" and "de jure" written in Hebrew, each of which I tagged as
upos=X
with deprelflat
from the first token.But what is the deprel to those expressions? It seems they function as
ADV
here, so they should receiveadvmod
(ba'alut, de); (yedey, de) [I think בידי here is not anADP
].The UD standard rules, however, require that only
ADV
should receiveadvmod
.Another example from @Hilla-Merhav involves code switching to an expression that also functions as
ADV
:What should we do in such cases? Should the UD standard rule enforcing
advmod
for only tokens withupos=ADV
have a caveat for tokens withupos=X
?