Document policy for foreign expressions and code-switching

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

273 stars 248 forks source link

Document policy for foreign expressions and code-switching #1001

Closed nschneid closed 11 months ago

nschneid commented 11 months ago

We have some documentation of the Foreign feature, a mention of foreign words in the X tag, and foreign expressions as an example of flat. But I can't find an overarching discussion of how to deal with foreign expressions.

Would the morphology overview be a good place for this?

Here is a crack at some text, including clarifications that were decided by the core group:

Foreign expressions and code-switching

A text that is primarily in one language may contain material that originated in another language. UD offers a few options for annotating such material, which we term cross-lingual content.

Cross-lingual and cross-orthography metadata—translations, glosses, transliterations—may also be provided.

When it comes to the morphosyntactic annotation of an expression originating in another language, there are a few options:

Option 1: Code-switched analysis

A treebank may opt to fully analyze the cross-lingual content as if it were in a treebank for the source language. This simulates a speaker with knowledge of the morphosyntax of both of the intermixed languages. The language of any content analyzed in this manner should be specified on individual tokens with the MISC feature Lang=CODE, as described here: this makes it clear which annotation guidelines are being followed for the cross-lingual content so that the annotations can be properly validated. The Foreign feature does not apply here.

That would be a/DET coup/NOUN d'/ADP état/NOUN .

Treebanks have wide latitude to decide what counts as a different language/code and whether to analyze its structure or not. However, this strategy is generally not recommended for names mentioned in isolation.

Option 2: Borrowed analysis

Another option is to analyze the cross-lingual content as if it is part of the vocabulary of the main language of the text. Tokenization principles of the main language, not the donor language, would be expected to apply. Borrowed words are not marked with Foreign=Yes because they are taken to be incorporated into the target language. However, the donor language may be made explicit with the OrigLang feature in MISC.

For multiword expressions, the UPOS and morphological features of the expression as a whole are copied to all the individual words, which are connected to the first word in a flat structure. (For names, the subtyped relation flat:name may optionally be used.)

Nominals—including concept terms, personal names, and book titles—are frequently borrowed and would typically be analyzed in this way. Other vocabulary may be considered borrowed as well.

Yeah , I think that would be kosher/ADJ .

That would be a/DET coup/NOUN d'état/NOUN .

We saw it on Al/PROPN Jazeera/PROPN .

Option 3: Foreign analysis

The third option is to treat the cross-lingual content as wholly unanalyzable foreign material. Words should receive the feature Foreign=Yes in FEATS and be tagged as X. Sequences of multiple foreign words are joined together by flat (optionally subtyped as flat:foreign). In contrast to Option 2, this is best suited to phrasal idioms, quoted utterances, and metalinguistic mentions. The foreign language, if known, is best made explicit with the OrigLang feature in MISC.

Well , c'est/X la/X vie/X .

" Lehitraot/X " means " see you later " in Hebrew .

" Monsieur/X , " she chided me , " nous/X ne/X parlons/X pas/X anglais/X dans/X cette/X classe/X !/PUNCT "

rueter commented 11 months ago

Thank you @nschneid for this simplified version. It looks like it will get much more complex in the long run. In narrative number three, the vocative "Monsieur" is likely to be recognize as a title, and therefore it would be tagged as NOUN. In a similar way, "c'est la vie" would be recognized as a kind of formulative, i.e., we would get an INTERJ-type phrase head tied together with flat deps, so it might seem that the relation would be discourse.

dan-zeman commented 11 months ago

Would the morphology overview be a good place for this?

No. It is not just about morphology. Code switching has implications for syntax, too.

dan-zeman commented 11 months ago

Option 1: Code-switched analysis

A treebank may opt to fully analyze the cross-lingual content as if it were in a treebank for the source language. ... The Foreign feature does not apply here.

The Foreign=Yes feature does apply here and when I annotate, I use both Foreign=Yes in FEATS and Lang=xx in MISC. It is useful because I can still collect paradigm tables from the corpus (form + lemma + UPOS + features) and easily spot and skip foreign words (and typos). I don't have to check two places at once (especially because not every foreign word has the Lang attribute). The validator should be able to handle this (essentially assuming that if Lang=xx is given, all features are seen as features of language xx, except for the Foreign=Yes feature, which is interpreted from the perspective of the host language).

But the feature would not apply when the whole corpus is declared as code-switching, i.e., none of the (typically two) languages is considered domestic. (And assuming that the word in question belongs to one of the code-switching languages and not to a third one.)

dan-zeman commented 11 months ago

Option 2: Borrowed analysis

Another option is to analyze the cross-lingual content as if it is part of the vocabulary of the main language of the text.

I would add that the borrowed analysis is preferable (if not the only possible) when the borrowed word has acquired morphology of the host language, different from the morphology of the source language. For example,

[cs] Jeďte po dálnici až k exitu 36. "Follow the highway until exit 36."

Here, exit is borrowed from English (pure Czech would be k výjezdu 36) but it has a form that does not exist in English and it should receive the Czech features Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing|Polarity=Pos.

Similarly, domesticated spelling is a signal of borrowing. For example, in Czech you can encounter

[cs] O tomhle nemá ánunk. "He has no idea about this."

where ánunk comes from German and its original spelling is Ahnung "idea".

A gray area arises when the original language uses a different writing system. The word or phrase will probably appear transcribed in the host text but this does not necessarily make it a borrowing. On the other hand, it does not follow the original spelling, which makes it difficult to use the code-switching analysis and please the validator. For example,

[cs] Rus se zvedl a řekl: Vsjo búdět v parjádke. "The Russian got up and said: Vsyo budet v poryadke."

Here, the Russian phrase is transcribed from Все будет в порядке. It is certainly not a borrowing. But if we want the code-switching analysis, we must acknowledge that búdět is AUX. Then this transcribed form must still get the Russian lemma быть so that the validator can find it on the list of Russian auxiliaries.

Finally, I would say that modification of the foreign word by a non-foreign word is also a sign of borrowing:

[cs] To, co jste předvedli, bylo velké faux-pas. "What you performed was a major faux-pas."

nschneid commented 11 months ago

when the whole corpus is declared as code-switching

Is this done in the metadata somewhere?

Can a language code be provided at the level of a document or sentence, or does Lang have to be specified for every word in the sentence even if a large passage is not in the main treebank language?

dan-zeman commented 11 months ago

when the whole corpus is declared as code-switching

Is this done in the metadata somewhere?

Yes. Such treebanks are assigned to an artificial "language" which in fact represents two languages (which may or may not exist in UD separately), it has a private-area ISO code and its "family" is "Code switching". At present we have 5 such languages in the system:

qfn Frisian Dutch
qhe Hindi English
qaf Maghrebi Arabic French
qee Spanish English
qtd Turkish German

For example, Turkish-German uses the Lang attribute on every non-punctuation token (at least it looks so; I did not verify it). Most words are either Turkish (Lang=tr) or German (Lang=de). Some words are Lang=qtd because they are both languages together: e.g., Prüfungum is a German noun with a Turkish possessive suffix. In addition, they have CSID attribute in MISC (code switching ID?) which seems to be one of TR, DE, MIXED, OTHER. There is obviously a large overlap with Lang but they had this annotation first and want to preserve it. They use Foreign=Yes for tokens that are neither Turkish nor German, for example for the English preposition of (but also for names of English-speaking people, which I would not do).

Besides the five code switching languages above, some other treebanks may contain a significant amount of code switching even though they are assigned to one language. Sometimes it means that code switching has become part of the language because its speakers live under heavy influence of a majority language. I believe this is the case of Komi Zyrian IKDP (code switching with Russian). In this case the metadata will not directly reveal it (except that you can search for the Lang attributes and count them).

Can a language code be provided at the level of a document or sentence, or does Lang have to be specified for every word in the sentence even if a large passage is not in the main treebank language?

As far as the validator is concerned, it must be on each token individually. It is much easier to process (not just for the validator but for any tool that is interested in Lang).

nschneid commented 11 months ago

OK thanks. Here is a second draft:

Foreign expressions and code-switching

A corpus may contain material from multiple languages. There are a few scenarios for how this is annotated, depending on the prevalence of multiple languages in the corpus and the extent to which expressions have been sufficiently integrated into a new language that they can be considered borrowings.

Inherently code-switched corpora

Every UD corpus is listed under an ISO language code. Most UD corpora have a single primary language. A few corpora, however, feature extensive code-switching between multiple (usually two) languages, and are listed under a custom code for the code-switched language variety. For example, the Turkish German variety bears the qtd code at the corpus level.

In inherently code-switched corpora, every word must have a Lang feature in the MISC column to indicate which language it belongs to. Most often, it will be one of the languages comprising the multilingual variety (for Turkish German, either Turkish Lang=tr or German Lang=de). Occasionally, a word will be specific to the multilingual variety (Lang=qtd). None of these are considered foreign in the context of the corpus.

It is, of course, possible for an inherently code-switched corpus to contain expressions from "third party" languages. These are annotated as cross-lingual content as described below.

Cross-lingual content

When a corpus contains material from a language other than its declared language(s), UD offers a few options for annotating such material, which we term cross-lingual content. It may be analyzed as either foreign or borrowed.

(Here we distinguish content—the words of a sentence in the main annotation—from metadata. Cross-lingual and cross-orthography metadata—translations, glosses, transliterations—may also be provided.)

For morphosyntactic annotation of an expression originating in another language, there are a few options:

Option 1: Code-switched analysis

A treebank may opt to fully analyze the cross-lingual content as if it were in a treebank for the source language. This simulates a speaker with knowledge of the morphosyntax of both of the intermixed languages. The language of any content analyzed in this manner should be specified on individual tokens with the MISC feature Lang=CODE, as described here: this makes it clear which annotation guidelines are being followed for the cross-lingual content so that the annotations can be properly validated. Unless the language is inherently associated with the corpus-level language code (see Inherently code-switched corpora above), the cross-lingual portion is considered foreign material and should be annotated with Foreign=Yes in FEATS.

That would be a/DET coup/NOUN d'/ADP état/NOUN .

Treebanks have wide latitude to decide what counts as a different language/code and whether to analyze its structure or not. However, this strategy is generally not recommended for names mentioned in isolation.

Option 2: Borrowed analysis

Another option is to analyze the cross-lingual content as if it is part of the vocabulary of the main language of the text. Tokenization principles of the main language, not the donor language, would be expected to apply. Borrowed words are not marked with Foreign=Yes because they are taken to be incorporated into the target language. However, the donor language may be made explicit with the OrigLang feature in MISC.

For multiword expressions, the UPOS and morphological features of the expression as a whole are copied to all the individual words, which are connected to the first word in a flat structure. (For names, the subtyped relation flat:name may optionally be used.)

Nominals—including concept terms, personal names, and book titles—are frequently borrowed and would typically be analyzed in this way. Other vocabulary may be considered borrowed as well.

Yeah , I think that would be kosher/ADJ .

That would be a/DET coup/NOUN d'état/NOUN .

We saw it on Al/PROPN Jazeera/PROPN .

If a word from another language has target-language inflectional morphology, this should be treated as borrowed so the morphology can be properly encoded in features. Take this Czech example:

Jeďte po dálnici až k exitu/NOUN 36. "Follow the highway until exit 36."

The form "exitu" does not exist in English and must therefore receive Czech morphological features. A borrowed expression may also bear target-language modifiers, for example.

Option 3: Foreign analysis

The third option is to treat the cross-lingual content as wholly unanalyzable foreign material. Words should receive the feature Foreign=Yes in FEATS and be tagged as X. Sequences of multiple foreign words are joined together by flat (optionally subtyped as flat:foreign). In contrast to Option 2, this is best suited to phrasal idioms, quoted utterances, and metalinguistic mentions. The foreign language, if known, is best made explicit with the OrigLang feature in MISC.

Well , c'est/X la/X vie/X .

" Lehitraot/X " means " see you later " in Hebrew .

" Dans/X cette/X classe/X , " she chided me , " nous/X ne/X parlons/X pas/X anglais/X ! "

nschneid commented 11 months ago

Here, the Russian phrase is transcribed from Все будет в порядке. It is certainly not a borrowing. But if we want the code-switching analysis, we must acknowledge that búdět is AUX. Then this transcribed form must still get the Russian lemma быть so that the validator can find it on the list of Russian auxiliaries.

It seems like it would make sense to have a feature indicating an alternate orthography/script. This would be useful for a text containing transliterations or phonetic transcriptions, and for languages where multiple orthographies are used (e.g. Arabic + Arabizi). Perhaps spelling variation as well. Have any treebanks been using such a feature?

dan-zeman commented 11 months ago

Here, the Russian phrase is transcribed from Все будет в порядке. It is certainly not a borrowing. But if we want the code-switching analysis, we must acknowledge that búdět is AUX. Then this transcribed form must still get the Russian lemma быть so that the validator can find it on the list of Russian auxiliaries.

It seems like it would make sense to have a feature indicating an alternate orthography/script. This would be useful for a text containing transliterations or phonetic transcriptions, and for languages where multiple orthographies are used (e.g. Arabic + Arabizi). Perhaps spelling variation as well. Have any treebanks been using such a feature?

We used to have additional values of Foreign but it created confusion and it was used very rarely. If anything like that is done at the feature level, it should be a new feature, so that Foreign stays boolean (Yes or empty).

But it is a wide area and variation can occur at different levels. The above example pertains to one phrase and it is most likely to occur as a citation within another language. Sometimes you have the whole corpus in a specific spelling (for example, Serbian uses Cyrillic or Latin script, the single Serbian treebank in UD uses Latin; Sanskrit has been written in several different scripts depending on time and location, in UD we have one treebank in Devanagari and another in Latin-based transcription). And sometimes there are competing orthography standards within one language, and they may be mixed in one treebank. I guess you have this problem with British/American/whatever English, but I suspect it occurs to a much higher level in minority languages such as Nahuatl or Low Saxon.

nschneid commented 11 months ago

Implemented at https://universaldependencies.org/foreign.html. Do the trees look OK? @dan-zeman, feel free to add features to the Czech borrowing.

jnivre commented 11 months ago

Looks good to me.

Joakim

Skickat från Outlook för iOShttps://aka.ms/o0ukef

Från: Nathan Schneider @.> Skickat: Tuesday, December 12, 2023 11:50:01 PM Till: UniversalDependencies/docs @.> Kopia: Subscribed @.***> Ämne: Re: [UniversalDependencies/docs] Document policy for foreign expressions and code-switching (Issue #1001)

Implemented at https://universaldependencies.org/foreign.html. Do the trees look OK? @dan-zemanhttps://github.com/dan-zeman, feel free to add features to the Czech borrowing.

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/1001#issuecomment-1852935861, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVVBFO35NNTXRVMSJA3YJDNRTAVCNFSM6AAAAABAEPY7Y6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJSHEZTKOBWGE. You are receiving this because you are subscribed to this thread.Message ID: @.***>

VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

dan-zeman commented 11 months ago

Looks good to me too. And I have completed the annotation of the Czech example.

Stormur commented 11 months ago

I have some comments about the revised definition.

Cross-lingual content

A treebank may opt to fully analyze the cross-lingual content as if it were in a treebank for the source language. This simulates a speaker with knowledge of the morphosyntax of both of the intermixed languages.

This passage is not clear to me and I fear it is misleading for (new) annotators. In general, the annotation endeavour of UD never simulates a speaker: the addition of linguistic feature is actually very much top-down and the typological tags often do not correspond with the "intuition" or the "traditional grammars" of even linguistically aware speakers. So, this is the first reason I would like to see this passage rephrased. Let's not link the annotation to some "intuition-based", "inherent knowledge" framework.

The second reason is that the morphosyntactic knowledge of the speaker is rather irrelevant from the point of view of annotation when dealing with cross-lingual content: I think we just have to distinguish between intact and adapted (see below) material. Then if the material is intact, it simply needs the features of its original language.

This is useful to retrieve iteresting information such as how foreign words are used in a given language, and to detect trends. For example:

it. Ho partecipato a un safari. / Ho partecipato a molti safari. 'I took part in a safari. / I have taken part in many safaris.'
- safari is a Swahili word which is left unchanged in the singular and in the plural: it is just made to agree with a "standard masculine gender" (un, molti vs fem. una, molte), but that's it. Simply the "basic" form of the original language has been taken in.
it. Hai creato un murales. / Hai dipinto molti murales. 'You have created a wall painting. / You have painted many wall paintings'
- murales comes from Spanish and it is the plural form of ADJ murale 'wall [as nmod]'. For some reason the plural form has been taken in, and is the prevailing one even though some dictionaries still prescribe the form murale (which also exists independently in Italian) as the singular one. This is interesting because we see a trend of taking in some words in the plural: this also happens with silos in Italian, but also as a wider trend in many other languages. We need this information annotated even if the speaker just uses murales etc. as a singular form.

Treebanks have wide latitude to decide what counts as a different language/code and whether to analyze its structure or not.

More generally, in the documentation I would actually like to see stressed that as far as possible this cross-lingual annotation has to be the favoured one, at least in the long term. It simply is the most informative and meaningful one and well, what goes more towards "universality"? Criteria can be defined with a good level of precision. And if a foreign word is not adapted, its belonging to that other language's system is always active, even if quiescent (e.g. cultured Italian speakers occasionally using Länder as a plural of ger. Land, where the singular form would actually be the prescriptive one).

However, this strategy is generally not recommended for proper names mentioned in isolation, such as names of people or places used in the target language.

This passage does not make sense to me and I would suggest to remove it. For once, it promotes the presupposed exceptionality of proper names while there is hardly any evidence for it: names of any kind have been kept intact or adapted between any two languages in the world, and so it is desirable to annotate this fact for couples such as fr. Hortense / it. Ortensia. It simply is a factual distinction. Then, isolation is not a criterion in annotation of cross-lingual content, and might actually represent the prototypical case.

Option 2: Borrowed analysis

However, the donor language may be made explicit with the OrigLang feature in MISC.

Good to finally know the difference between Lang and OrigLang.

Symmetrically as in the previous point, I would like to see stressed that this analysis only makes sense for clear cases like the mentioned Czech k exitu, while if the material is left intact as in coup d'état (even keeping the original orthography!) the cross-lingual analysis should be favoured.

Another case that comes to my mind is eng. capish/capeesh/capiche, from Italian capisci /kaˈpiːʃi/ '(do) you understand'. This is totally adapted, as shown even in the orthography, and has become something else from the paradigm-belongng Italian form.

A borrowed expression may also bear target-language modifiers, for example.

I do not get how this line helps, so I would suggest to simply remove it. It reads tautological at least, in the sense that if a word has been borrowed, then of course it can be modified. Conversely, modification can readily happen for non-adapted words, too.

Option 3: Foreign analysis

In contrast to Option 2, this is best suited to phrasal idioms, quoted utterances, and metalinguistic mentions.

I think this can also be misleading. Previously in the documentation page, it is said that "wide latitude" is given to treebanks on how to treat foreign words, then this comes, but I do not see how phrasal idioms etc. can be different from isolated material (as it seems implied now by the first two points).

Again, I would like to see stressed that this 3rd option is simply (at least in the long run) an ad interim solution in absence of a meaningful cross-lingual analysis, and this is independent from the length of the foreign passage.

amir-zeldes commented 11 months ago

In general, the annotation endeavour of UD never simulates a speaker

I would tend to agree with that, maybe this is not a good formulation.

it is desirable to annotate this fact for couples such as fr. Hortense / it. Ortensia

I think proper names ARE exceptional, in that if someone's last name is Takahashi, then their name is not suddenly transformed into "Highbridge" in English (the literal meaning) - it stays the same. In that respect, Hortense is not the same as Ortensia (despite etymology), and I think if my name were Hortense I would say my name in Italian is also Hortense, not Ortensia. I think this is also how Italy would issue me a visa or passport if that was my birth name.

this cross-lingual annotation has to be the favoured one, at least in the long term. It simply is the most informative and meaningful one ... may also bear target-language modifiers ... if the material is left intact as in coup d'état (even keeping the original orthography!) the cross-lingual analysis should be favoured

Not necessarily. The phrase "coup d'état" happens to be nominal in both French and English, but if I say something "has a certain je ne sais quoi", many English speakers would use that in speech as an unanalyzable nominal. Technically the multilingual analysis would regard this as a clause and might be tempted to give it such a deprel, but the modifier "certain" is a good indication that this is not the status of this borrowed item in English. I think the best annotation there is to treat "certain" as amod, "a" as det and "je" as obj of "has" with flat dependents, not ccomp(has,sais) and then det(sais,a).

nschneid commented 11 months ago

To my knowledge, UD doesn't take a position on exactly whose linguistic knowledge is being modeled with trees—the speaker's? hearer's? some average over a speech community? There may be specific treebanks that do seek to model the knowledge of specific individuals (learners, for instance). But I can rephrase the "simulates a speaker" part to clarify that this is just an analogy, not a theoretical claim.

It sounds like you're advocating for treebanks to adopt the code-switching analysis. That may not be practical for all treebanks, though: it may be hard to find annotators familiar with the quoted languages, let alone prepared to apply the annotation guidelines for those languages (which may involve language-specific subtypes etc.). We don't want to encourage low-quality annotation of foreign language material by those who lack the qualifications, polluting the collected UD data in that language. I think the neutral position—that it's up to treebanks to decide—is the right one.

Regarding morphological adaptation: Are you arguing that there should be features indicating a loan word was plural in the source language but singular in the target language? If so that may motivate OrigNumber (alongside OrigLang). But I'm not aware of current treebanks having done this.

Regarding phrasal idioms: This is simply to to suggest that e.g. "C'est la vie" has no internal syntax as an idiom of English. I'm not sure it would make sense to pretend it consists of several VERBs, for example.

Stormur commented 11 months ago

I think proper names ARE exceptional, in that if someone's last name is Takahashi, then their name is not suddenly transformed into "Highbridge" in English (the literal meaning) - it stays the same. In that respect, Hortense is not the same as Ortensia (despite etymology), and I think if my name were Hortense I would say my name in Italian is also Hortense, not Ortensia. I think this is also how Italy would issue me a visa or passport if that was my birth name.

Yes, the fashion now is to leave the name as it is, so keep it as a possible foreign word: Takahasi is a Japanese (Lang=ja) name that we use, while Hortense is the French (Lang=fr) version of a name that also has an Italian (Spanish, Rumanian, etc.) form. Some decades ago, it would have been adapted. This is still done here and there, both at official and unofficial level. Every such form keeps its "mark" as a foreign word.

The exceptionality might be at (social) levels of iconicity, saliency, extravagance... but not morphosyntax.

Not necessarily. The phrase "coup d'état" happens to be nominal in both French and English, but if I say something "has a certain je ne sais quoi", many English speakers would use that in speech as an unanalyzable nominal. Technically the multilingual analysis would regard this as a clause and might be tempted to give it such a deprel, but the modifier "certain" is a good indication that this is not the status of this borrowed item in English. I think the best annotation there is to treat "certain" as amod, "a" as det and "je" as obj of "has" with flat dependents, not ccomp(has,sais) and then det(sais,a).

It is not different than the exact equivalent in, say, Italian: ha un certo non so che: non so che 'I do not know what' is a PART-VERB-PRON phrase, it has a predicate, but this does not prevent it to be used as an argument itself, and the meaningful analysis is to make it depend as obj while keeping its internal dependencies. Then, it does not matter if one prefers saying ha un certo je ne sais quoi or ha un certo ich weiß nicht was, etc.. I mean, it is in fact treated as a nominal block even in French.

This is way different from other phenomena like the Hungarian muszáj 'must', which comes from ger. muss sein 'it has to be', but has been completely morphosyntactically incorporated into the language.

nschneid commented 11 months ago

Updated the page. There were inconsistent signals regarding titles. Personally I wouldn't mind saying that "Le festin de Babette" is borrowed as a PROPN. But I guess others in the discussion think of titles as more compositional than typical names.

nschneid commented 11 months ago

If, as a speaker of English, I call my Italian friend Marco instead of translating his name to Mark, am I speaking Italian? In some narrow sense, yes. But it doesn't entail that I have any morphosyntactic knowledge of Italian—how names may or may not inflect for case and so on. As a practical matter, a speaker or annotator may not know the name's language of origin or even how to draw a sharp line between "English" and "non-English" names. Also true of place names: do we want to say that "Massachusetts" has a language code for the Massachusett language? I don't think this is remotely practical to implement at scale, so it is simpler to treat such names as borrowings, but if treebank developers have the resources to conduct etymological inquiries, they are welcome to add OrigLang.

Stormur commented 11 months ago

To my knowledge, UD doesn't take a position on exactly whose linguistic knowledge is being modeled with trees—the speaker's? hearer's? some average over a speech community? There may be specific treebanks that do seek to model the knowledge of specific individuals (learners, for instance). But I can rephrase the "simulates a speaker" part to clarify that this is just an analogy, not a theoretical claim.

Well, just the fact that it strives towards a typological approach removes UD's point of view from that of a speaker or a hearer, in general from a spontaneous use of a natural language. Indirect proof of this are all the discussions taking place in these issues...

It sounds like you're advocating for treebanks to adopt the code-switching analysis. That may not be practical for all treebanks, though: it may be hard to find annotators familiar with the quoted languages, let alone prepared to apply the annotation guidelines for those languages (which may involve language-specific subtypes etc.). We don't want to encourage low-quality annotation of foreign language material by those who lack the qualifications, polluting the collected UD data in that language. I think the neutral position—that it's up to treebanks to decide—is the right one.

Yes I do, but at the same time I formulated it as in the long term: I think that we should by all means favour this kind of annotation, of course when it can be done in a sensible way, presenting it as the one to aim at. Then, if, for many practical and good reasons this is not (easily) possible, we still contemplate the "agnostic", "flat X" annotation, as detailed in the guidelines, and as mostly done until now.

Regarding morphological adaptation: Are you arguing that there should be features indicating a loan word was plural in the source language but singular in the target language? If so that may motivate OrigNumber (alongside OrigLang). But I'm not aware of current treebanks having done this.

No. If it is a foreign non-adapted word, only the original language's morphology matters (but see below). To assign a Number=Sing to murales in Italian would be an erroneous backprojection of a syntactic fact inside the foreign morphology of this word (in Italian it would be murali, by the way). This would be a case of contextual annotation. For the same reason I would not assign any Number to safari, unless its form in Swahili clearly encodes one.

I was wondering, though, about possible cases like I like pizzes: here we would observe at the same time an Italian inflection (pl. pizze vs sg. pizza) and an English one (the pl. -s). To handle such cases, I would propose to stick to the original language annotation, all the wile adding a layered feature, e.g. Number=Plur|Number[en]=Plur.

Regarding phrasal idioms: This is simply to to suggest that e.g. "C'est la vie" has no internal syntax as an idiom of English. I'm not sure it would make sense to pretend it consists of several VERBs, for example.

OK, possibly, but I see no reason to not encourage a similar annotation (which in my opinion is the more meaningful one, though admittedly more difficult to achieve). Because how isolated these idioms might be, they do possess their own (foreign) internal syntax.

Stormur commented 11 months ago

If, as a speaker of English, I call my Italian friend Marco instead of translating his name to Mark, am I speaking Italian? In some narrow sense, yes. But it doesn't entail that I have any morphosyntactic knowledge of Italian—how names may or may not inflect for case and so on. As a practical matter, a speaker or annotator may not know the name's language of origin or even how to draw a sharp line between "English" and "non-English" names.

You would be using an Italian name, and this might be interesting indeed to annotate. Because nothing prevents anyone (and we in fact do observe this happening daily) to call your friend Mark, or maybe latinately Marcus, and to code-switch with regard to his name, be it for style, joke, conviviality... Again, this is interesting to annotate, if it can be done.

No claims about speaking any one language or being aware of its workings.

Also true of place names: do we want to say that "Massachusetts" has a language code for the Massachusett language? I don't think this is remotely practical to implement at scale, so it is simpler to treat such names as borrowings, but if treebank developers have the resources to conduct etymological inquiries, they are welcome to add OrigLang.

We are focusing on person and place names here, but they, morphosyntactically, really are not different from any other NOUN, thought it is true they are a particular semantical subcategory that crosslinguistically sees more conservative patterns in how they are adapted in the various languages - and again, this is interesting to annotate!!!

In this specific case I would agree on OrigLang, since this is an adaptation into English. You do not write it Massachuseuck, nor in the least pronounce it /məhs at͡ʃəw iːs iː ak/ (source: Wikipedia), not to speak that I think only few know it means "at the grat hill". So there seems to be no reason to annotate it as a foreign word, and I would not argue for that.