UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

features to be added? #735

Closed eduarddrenth closed 3 years ago

eduarddrenth commented 4 years ago

Dear all,

For our corpora, dictionaries and lexicons we use universaldependencies for terminology.

We define them in a tei odd, see https://bitbucket.org/fryske-akademy/tei-encoding/src/master/reusables/customization/v2_0/corpora_linguistics.odd

The following features we use but could not find in UD, perhaps these some of those are of use in UD.

Overview:

feature suffix: Boolean feature, Is this a suffix word in a compound, that usually cannot stand on its own?

feature oblique: case other than nominative or vocative

feature aux:

value rest:
Rest group auxiliary.

value tense:
Tense auxiliary.

value mod:
Modal verbs that counts as auxiliary.

feature pronouns:

value drop:
pronoun drop, omission of pronouns because they can be infered

value clitic:
pronoun clitic, most personal pronouns have a clitic form, which is the result of either vowel deletion, vowel reduction, monophthongization or schwa deletion, while there are also cases of suppletion.

feature diminutive:

value dim:
diminutive

feature inflections:

value infl:
inflected

value uninf:
uninflected

feature valencys:

value mtran:
a monotransitive verb takes two arguments (of which one object)

value tran:
a transitive verb requires one or more objects

value intran:
an intransitive verb takes one argument (no object)

value ditran:
a ditransitive verb takes three arguments (of which a direct and an indirect object)

feature construction:

value attr:
attributive

feature convertedfroms:

value adj:
adjective used as another category

value adv:
adverb used as another category

value ver:
verb used as another category

value num:
numeral used as another category

value pro:
pronomen used as another category

value part:
verbform part used as another category

feature predicate:

value pred:
statement about the subject

value pos.unknown: pos unkown or irrelevant

ftyers commented 4 years ago

Hi, thanks for your contribution! :)

For POS Unknown, the canonical tag would be X.

These are not in the list, but the following features are used for two of the things you bring up in more than one language:

For Valency, Valency=1, Valency=2 are used in Erzya, Chukchi and Bambara.

For Clitics, Clitic=Yes is used in Italian, Galician and Walpiri.

For Diminutive, Afrikaans (probably closest to Frisian) uses Degree=Dim, while Erzya uses Derivation=Dimin.

For some of the others it would be useful to have examples.

eduarddrenth commented 4 years ago

The X is good to know, overlooked it in the past or perhaps it wasn't there yet.

So, in general, what is the proposed way to work with ud?

1) stick as close to what's under pos/index.html and feat/index.html 2) for all that is not there, use your own?

That's what I do, promoting exchange and enabling comparison.

Then, back to what I brought in, exactly which of the features may be of interest to general (not frisian) UD? So that we can compose some more docs and examples.

ftyers commented 4 years ago

Usually what I do is search for examples in the related languages, and if not in other languages and try and use either something that has already been used... if it looks like the same thing. If it isn't then I make an issue in GitHub like you have done. Usually I would make one issue per feature unless they were related in some way.

However, it is important to include examples as it is difficult to tell which features may be both missing and in scope without looking at these. E.g. "convertedfroms" looks like derivation, most of which typically isn't covered by UD, but it is hard to say without examples.

dan-zeman commented 3 years ago

There is an automatically generated (twice a year, at release time) list of feature-value pairs used in UD treebanks. It contains both universally-defined and language-specific features.

I don't understand what the feature Suffix would be supposed to encode.

For the 'oblique case', the guidelines actually suggest using Case=Acc. (Although it may sound a bit confusing when confronted with the core-oblique terminology for nominals in UD, where accusatives are typically not oblique.)

eduarddrenth commented 3 years ago

suffix: -earje, -earre, -en prefix: in-, op-, anti-

oblique is not that important for us, perhaps we'll drop it some day

Most of the annotations mentioned are figured out here: https://web2.fa.knaw.nl/corpus-frontend/frysk/search. Use extended search and filter on language category MidFrysk.

dan-zeman commented 3 years ago

Many languages use prefixes and suffixes. If UD wanted to explicitly say that a word form contains a prefix, it would apply to most UD languages. But UD does not do that. (Instead, UD encodes the grammatical features that the affix contributes to the word.) Though if the affix is inflectional, it may be indirectly observable when the word form is compared with its lemma.

dan-zeman commented 3 years ago

As for the auxiliaries, some treebanks use a language-specific feature VerbType to mark modal verbs.

eduarddrenth commented 3 years ago

I know, but my aux is a feature that distinguishes forms with pos=aux

Op za 14 nov. 2020 17:44 schreef Dan Zeman notifications@github.com:

As for the auxiliaries, some treebanks use a language-specific feature VerbType https://universaldependencies.org/u/feat/VerbType.html to mark modal verbs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/735#issuecomment-727233160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACN2KKE5SOFWPSPEHXHUO43SP2XV5ANCNFSM4SWLY5UA .

dan-zeman commented 3 years ago

If the auxiliaries are auxiliary verbs, then verbal features are appropriate for them as well.

eduarddrenth commented 3 years ago

Yes, perhaps it is better if I extend VerbType with "tense", drop my "rest" group auxiliary (doesn't add any information). Then I can use verbform.mod and verbform.tense for pos.aux.

dan-zeman commented 3 years ago

Note that for a feature to be valid (even if language-specific), the value must start with a capital English letter.

eduarddrenth commented 3 years ago

Which tool is going to validate that? Or is it an agreement or standard? In practice I found it easier and sufficient to do everything in lower-case, so pos.verb, number.sing, etc. I define those values as enumeration in TEI odd/rng/xsd, together with a documenting text, I use these enumerations for both validation and presenting dropdowns with documentation to users.

dan-zeman commented 3 years ago

If you are doing that for yourself/your own project, you are absolutely free to use TEI or any other format you see fit. But if your goal is to make it at some point a part of the UD project, you will have to convert it so that it complies with the UD guidelines. The restrictions on feature names and values are documented in format, and if the data does not comply with that, it will not pass the online validation.

eduarddrenth commented 3 years ago

At some point we want to contribute Frisian in UD, documentation and a treebank. I.m.o a project is required to bring that further.

In Groningen some uncoordinated initiatives for a Frisian treebank are on the way. There is already a udpipe based Frisian pos tagger online and an improved / command line version on te way. We have a UD based Frisian lexcion, TEI/UD based Frisian dictionaries and TEI/UD based Frisian dictionaries. All of these solution have json services promoting interchange.

Benefit (but also work) for me now is that I can create some issues to stick closer to UD. If I want to turn this issue into an official contribution this will also require more effort, perhaps from linguists.

My original idea was to just provide our documented terminology in the hope UD people would cherry-pick. Apparently more effort is required. I don't know if I can organize that.

heeringa0 commented 2 years ago

In addition to the questions and comments above, Frisian has pro-drop, e.g.

Can you come?

can be translated as:

Kinst do komme? Kinst komme? Kinsto komme?

When 'do' is left out, it would be nice to have a feature for the VERB 'kinst' or 'kinsto' that indicates 'pro-drop'. Pro-drop is also found in Italian and Spanish, but it looks if pro-drop is not indicated in the respective UD corpora of Italian en Spanish.

What do you recommend for Frisian?

To go even one step further .... the Frisian sentence

Witst do datst do grut bist?

which literally translated to English is:

Know you that you tall are?

can be shortened to:

Witst datst grut bist?

In this case both 'Witst' (VERB) and 'datst' (SCONJ) should get a feature like pro-drop=yes. How can this be implemented using the UD tags/features/attributes/codings?

Stormur commented 2 years ago

Hi!

I don't think this should be annotated at all, as it is not in Italian, Spanish, and any other: do not annotate what is not there! The "pro drop" will simply result from the fact that there is no word depending as nsubj/csubj. If the verb form (as in Italian, etc.) bears the person (as it seems to be the case for forms like Frisian witst, which might be a reason for the alleged "drop" in the first place), i.e. is annotated for Person, then good. Else, the intended person will be contextual, but is not annotated. There simply is nothing to annotate! Further, I think the problem in a possible ProDrop feature is that it would be a clausal one, rather than tied to a single form.

PS: by the way, the form kinsto does seem to have a person mark, and I wonder if it can be considered "morphological", or expressed as a multiword token.

eduarddrenth commented 2 years ago

Both forms are tipical Frisian constructs we would like to know of. When these constructs are not annotated in some way they cannot be recognized or queried for in for example corpus-linguistics.

nschneid commented 2 years ago

Things that are not part of "surface syntax", including omitted/elided elements, could be referenced in the Enhanced Dependencies and/or MISC features. Treebanks are free to innovate on the MISC features, so you could put ProDrop=Yes there. I don't know if other languages with pro-drop are doing something like this.

heeringa0 commented 2 years ago

Is it possible to put this under XPOS as well? I understood XPOS can be used for language-specific features.

Stormur commented 2 years ago
  • kinst can be pro-drop depending on context

    • kinsto is pro-clitic

Both forms are tipical Frisian constructs we would like to know of. When these constructs are not annotated in some way they cannot be recognized or queried for in for example corpus-linguistics.

I admit to not know exactly the meaning of pro-clitic. However, kinsto (which to my eyes resembles the vernacular German form of kannste for kannst du 'can you', and also the regular cliticization of þú 'you' in Icelandic, e.g. likar þú > likarðu) with respect to kinst seems to incorporate a pronoun, is it that?

So I could envision something along the lines of:

1    kinst    kinne    VERB    Number=Sing|Person=2

vs.

1-2    kinsto    _    _
1    kinst    kinne    VERB    Number=Sing|Person=2
2    o    do    PRON    Number=Sing|Person=2|PronType=Prs

that is, the multiword treatment recognizes that there is a clitic which gives the person in addition to the marking on the verb, and the two constructions are different and retrievable.

Stormur commented 2 years ago

Is it possible to put this under XPOS as well? I understood XPOS can be used for language-specific features.

XPOS is for external annotations, coming for example from a treebank that has been converted into UD, to keep it as a sort of record. But if you want to incorporate this annotation inside UD, then either the FEATS or the MISC fields are the way.

eduarddrenth commented 1 year ago
  • kinst can be pro-drop depending on context

    • kinsto is pro-clitic

Both forms are tipical Frisian constructs we would like to know of. When these constructs are not annotated in some way they cannot be recognized or queried for in for example corpus-linguistics.

I admit to not know exactly the meaning of pro-clitic. However, kinsto (which to my eyes resembles the vernacular German form of kannste for kannst du 'can you', and also the regular cliticization of þú 'you' in Icelandic, e.g. likar þú > likarðu) with respect to kinst seems to incorporate a pronoun, is it that?

So I could envision something along the lines of:

1    kinst    kinne    VERB    Number=Sing|Person=2

vs.

1-2    kinsto    _    _
1    kinst    kinne    VERB    Number=Sing|Person=2
2    o    do    PRON    Number=Sing|Person=2|PronType=Prs

that is, the multiword treatment recognizes that there is a clitic which gives the person in addition to the marking on the verb, and the two constructions are different and retrievable.

That looks good, and than perhaps:

1-2    kinst    _    _
1    kinst    kinne    VERB    Number=Sing|Person=2
2        do    PRON    Number=Sing|Person=2|PronType=Prs
dan-zeman commented 1 year ago

So I could envision something along the lines of:

1    kinst    kinne    VERB    Number=Sing|Person=2

vs.

1-2    kinsto    _    _
1    kinst    kinne    VERB    Number=Sing|Person=2
2    o    do    PRON    Number=Sing|Person=2|PronType=Prs

that is, the multiword treatment recognizes that there is a clitic which gives the person in addition to the marking on the verb, and the two constructions are different and retrievable.

That looks good, and than perhaps:

1-2    kinst    _    _
1    kinst    kinne    VERB    Number=Sing|Person=2
2        do    PRON    Number=Sing|Person=2|PronType=Prs

“Do not annotate things that are not there.”

You must provide a non-empty form for every word in a multi-word token. And you should not attempt to circumvent it by substituting the empty string with an underscore or something.

The form of the multi-word token is not required to be a concatenation of the forms of the individual words. Quite the opposite: the original idea of multi-word tokens was that the individual words will show forms that they would have if they occurred as independent words (thus German zum expands to zu + dem, not to zu + m).

No automatically verifiable rule prevents you from expanding kinst into kinst + do when the do does not occur overtly in the sentence but I don't think it's a good idea. Besides being ill-guided with respect to UD principles, it also makes it much harder for parsers to learn that kinst is sometimes regarded a multi-word token and sometimes not.

eduarddrenth commented 1 year ago

From UD perspective I can understand. But this means that linguists will not be able to search for pro-drop, which is realy a pitty. If I understand correctly the rule "don't annotate what isn't there" conflicts with the wish to search for pro-drop and other things that aren't there. Am I missing something? Is there a solution?

amir-zeldes commented 1 year ago

As pointed out by @nschneid and @Stormur you can use a MISC field annotation. That's what I would recommend as well.

dan-zeman commented 1 year ago

From UD perspective I can understand.

This is a UD issue tracker, so I guess the UD perspective is what matters here :-)

UD is not a framework that accommodates anything and everything a linguist ever wanted to annotate. However, I think searching for pro-drop is still possible and quite simple: ask for verbal nodes that have no nsubj or csubj child, like in this query.

eduarddrenth commented 1 year ago

thanks for your patience and pointers

amir-zeldes commented 1 year ago

However, I think searching for pro-drop is still possible and quite simple: ask for verbal nodes that have no nsubj or csubj child

To be fair, that can find things other than pro-drop (e.g. you may not be searching for imperatives, or infinitival roots, or other things). Given enough morphological annotations you could probably formulate the right search, but there's nothing wrong IMO with having a MISC annotation for pro-drop if it's useful for someone.

dan-zeman commented 1 year ago

To be fair, that can find things other than pro-drop (e.g. you may not be searching for imperatives

That is true but the morphological features actually give you much more fine-grained options to specify what you are looking for (if they are present; unfortunately, the Frisian-Dutch treebank in UD does not have them).

But I am definitely not against adding some information on pro-drop in MISC.

eduarddrenth commented 1 year ago

work on frisian treebank is in progress, perhaps good to know