Closed eduarddrenth closed 3 years ago
Hi, thanks for your contribution! :)
For POS Unknown, the canonical tag would be X
.
These are not in the list, but the following features are used for two of the things you bring up in more than one language:
For Valency, Valency=1
, Valency=2
are used in Erzya, Chukchi and Bambara.
For Clitics, Clitic=Yes
is used in Italian, Galician and Walpiri.
For Diminutive, Afrikaans (probably closest to Frisian) uses Degree=Dim
, while Erzya uses Derivation=Dimin
.
For some of the others it would be useful to have examples.
The X is good to know, overlooked it in the past or perhaps it wasn't there yet.
So, in general, what is the proposed way to work with ud?
1) stick as close to what's under pos/index.html and feat/index.html 2) for all that is not there, use your own?
That's what I do, promoting exchange and enabling comparison.
Then, back to what I brought in, exactly which of the features may be of interest to general (not frisian) UD? So that we can compose some more docs and examples.
Usually what I do is search for examples in the related languages, and if not in other languages and try and use either something that has already been used... if it looks like the same thing. If it isn't then I make an issue in GitHub like you have done. Usually I would make one issue per feature unless they were related in some way.
However, it is important to include examples as it is difficult to tell which features may be both missing and in scope without looking at these. E.g. "convertedfroms" looks like derivation, most of which typically isn't covered by UD, but it is hard to say without examples.
There is an automatically generated (twice a year, at release time) list of feature-value pairs used in UD treebanks. It contains both universally-defined and language-specific features.
I don't understand what the feature Suffix
would be supposed to encode.
For the 'oblique case', the guidelines actually suggest using Case=Acc
. (Although it may sound a bit confusing when confronted with the core-oblique terminology for nominals in UD, where accusatives are typically not oblique.)
suffix: -earje, -earre, -en prefix: in-, op-, anti-
oblique is not that important for us, perhaps we'll drop it some day
Most of the annotations mentioned are figured out here: https://web2.fa.knaw.nl/corpus-frontend/frysk/search. Use extended search and filter on language category MidFrysk.
Many languages use prefixes and suffixes. If UD wanted to explicitly say that a word form contains a prefix, it would apply to most UD languages. But UD does not do that. (Instead, UD encodes the grammatical features that the affix contributes to the word.) Though if the affix is inflectional, it may be indirectly observable when the word form is compared with its lemma.
As for the auxiliaries, some treebanks use a language-specific feature VerbType to mark modal verbs.
I know, but my aux is a feature that distinguishes forms with pos=aux
Op za 14 nov. 2020 17:44 schreef Dan Zeman notifications@github.com:
As for the auxiliaries, some treebanks use a language-specific feature VerbType https://universaldependencies.org/u/feat/VerbType.html to mark modal verbs.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/735#issuecomment-727233160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACN2KKE5SOFWPSPEHXHUO43SP2XV5ANCNFSM4SWLY5UA .
If the auxiliaries are auxiliary verbs, then verbal features are appropriate for them as well.
Yes, perhaps it is better if I extend VerbType with "tense", drop my "rest" group auxiliary (doesn't add any information). Then I can use verbform.mod and verbform.tense for pos.aux.
Note that for a feature to be valid (even if language-specific), the value must start with a capital English letter.
Which tool is going to validate that? Or is it an agreement or standard? In practice I found it easier and sufficient to do everything in lower-case, so pos.verb, number.sing, etc. I define those values as enumeration in TEI odd/rng/xsd, together with a documenting text, I use these enumerations for both validation and presenting dropdowns with documentation to users.
If you are doing that for yourself/your own project, you are absolutely free to use TEI or any other format you see fit. But if your goal is to make it at some point a part of the UD project, you will have to convert it so that it complies with the UD guidelines. The restrictions on feature names and values are documented in format, and if the data does not comply with that, it will not pass the online validation.
At some point we want to contribute Frisian in UD, documentation and a treebank. I.m.o a project is required to bring that further.
In Groningen some uncoordinated initiatives for a Frisian treebank are on the way. There is already a udpipe based Frisian pos tagger online and an improved / command line version on te way. We have a UD based Frisian lexcion, TEI/UD based Frisian dictionaries and TEI/UD based Frisian dictionaries. All of these solution have json services promoting interchange.
Benefit (but also work) for me now is that I can create some issues to stick closer to UD. If I want to turn this issue into an official contribution this will also require more effort, perhaps from linguists.
My original idea was to just provide our documented terminology in the hope UD people would cherry-pick. Apparently more effort is required. I don't know if I can organize that.
In addition to the questions and comments above, Frisian has pro-drop, e.g.
Can you come?
can be translated as:
Kinst do komme? Kinst komme? Kinsto komme?
When 'do' is left out, it would be nice to have a feature for the VERB 'kinst' or 'kinsto' that indicates 'pro-drop'. Pro-drop is also found in Italian and Spanish, but it looks if pro-drop is not indicated in the respective UD corpora of Italian en Spanish.
What do you recommend for Frisian?
To go even one step further .... the Frisian sentence
Witst do datst do grut bist?
which literally translated to English is:
Know you that you tall are?
can be shortened to:
Witst datst grut bist?
In this case both 'Witst' (VERB) and 'datst' (SCONJ) should get a feature like pro-drop=yes. How can this be implemented using the UD tags/features/attributes/codings?
Hi!
I don't think this should be annotated at all, as it is not in Italian, Spanish, and any other: do not annotate what is not there! The "pro drop" will simply result from the fact that there is no word depending as nsubj
/csubj
. If the verb form (as in Italian, etc.) bears the person (as it seems to be the case for forms like Frisian witst, which might be a reason for the alleged "drop" in the first place), i.e. is annotated for Person
, then good. Else, the intended person will be contextual, but is not annotated. There simply is nothing to annotate! Further, I think the problem in a possible ProDrop
feature is that it would be a clausal one, rather than tied to a single form.
PS: by the way, the form kinsto does seem to have a person mark, and I wonder if it can be considered "morphological", or expressed as a multiword token.
Both forms are tipical Frisian constructs we would like to know of. When these constructs are not annotated in some way they cannot be recognized or queried for in for example corpus-linguistics.
Things that are not part of "surface syntax", including omitted/elided elements, could be referenced in the Enhanced Dependencies and/or MISC features. Treebanks are free to innovate on the MISC features, so you could put ProDrop=Yes there. I don't know if other languages with pro-drop are doing something like this.
Is it possible to put this under XPOS as well? I understood XPOS can be used for language-specific features.
kinst can be pro-drop depending on context
- kinsto is pro-clitic
Both forms are tipical Frisian constructs we would like to know of. When these constructs are not annotated in some way they cannot be recognized or queried for in for example corpus-linguistics.
I admit to not know exactly the meaning of pro-clitic. However, kinsto (which to my eyes resembles the vernacular German form of kannste for kannst du 'can you', and also the regular cliticization of þú 'you' in Icelandic, e.g. likar þú > likarðu) with respect to kinst seems to incorporate a pronoun, is it that?
So I could envision something along the lines of:
1 kinst kinne VERB Number=Sing|Person=2
vs.
1-2 kinsto _ _
1 kinst kinne VERB Number=Sing|Person=2
2 o do PRON Number=Sing|Person=2|PronType=Prs
that is, the multiword treatment recognizes that there is a clitic which gives the person in addition to the marking on the verb, and the two constructions are different and retrievable.
Is it possible to put this under XPOS as well? I understood XPOS can be used for language-specific features.
XPOS is for external annotations, coming for example from a treebank that has been converted into UD, to keep it as a sort of record. But if you want to incorporate this annotation inside UD, then either the FEATS or the MISC fields are the way.
kinst can be pro-drop depending on context
- kinsto is pro-clitic
Both forms are tipical Frisian constructs we would like to know of. When these constructs are not annotated in some way they cannot be recognized or queried for in for example corpus-linguistics.
I admit to not know exactly the meaning of pro-clitic. However, kinsto (which to my eyes resembles the vernacular German form of kannste for kannst du 'can you', and also the regular cliticization of þú 'you' in Icelandic, e.g. likar þú > likarðu) with respect to kinst seems to incorporate a pronoun, is it that?
So I could envision something along the lines of:
1 kinst kinne VERB Number=Sing|Person=2
vs.
1-2 kinsto _ _ 1 kinst kinne VERB Number=Sing|Person=2 2 o do PRON Number=Sing|Person=2|PronType=Prs
that is, the multiword treatment recognizes that there is a clitic which gives the person in addition to the marking on the verb, and the two constructions are different and retrievable.
That looks good, and than perhaps:
1-2 kinst _ _
1 kinst kinne VERB Number=Sing|Person=2
2 do PRON Number=Sing|Person=2|PronType=Prs
So I could envision something along the lines of:
1 kinst kinne VERB Number=Sing|Person=2
vs.
1-2 kinsto _ _ 1 kinst kinne VERB Number=Sing|Person=2 2 o do PRON Number=Sing|Person=2|PronType=Prs
that is, the multiword treatment recognizes that there is a clitic which gives the person in addition to the marking on the verb, and the two constructions are different and retrievable.
That looks good, and than perhaps:
1-2 kinst _ _ 1 kinst kinne VERB Number=Sing|Person=2 2 do PRON Number=Sing|Person=2|PronType=Prs
“Do not annotate things that are not there.”
You must provide a non-empty form for every word in a multi-word token. And you should not attempt to circumvent it by substituting the empty string with an underscore or something.
The form of the multi-word token is not required to be a concatenation of the forms of the individual words. Quite the opposite: the original idea of multi-word tokens was that the individual words will show forms that they would have if they occurred as independent words (thus German zum expands to zu + dem, not to zu + m).
No automatically verifiable rule prevents you from expanding kinst into kinst + do when the do does not occur overtly in the sentence but I don't think it's a good idea. Besides being ill-guided with respect to UD principles, it also makes it much harder for parsers to learn that kinst is sometimes regarded a multi-word token and sometimes not.
From UD perspective I can understand. But this means that linguists will not be able to search for pro-drop, which is realy a pitty. If I understand correctly the rule "don't annotate what isn't there" conflicts with the wish to search for pro-drop and other things that aren't there. Am I missing something? Is there a solution?
As pointed out by @nschneid and @Stormur you can use a MISC field annotation. That's what I would recommend as well.
From UD perspective I can understand.
This is a UD issue tracker, so I guess the UD perspective is what matters here :-)
UD is not a framework that accommodates anything and everything a linguist ever wanted to annotate. However, I think searching for pro-drop is still possible and quite simple: ask for verbal nodes that have no nsubj
or csubj
child, like in this query.
thanks for your patience and pointers
However, I think searching for pro-drop is still possible and quite simple: ask for verbal nodes that have no nsubj or csubj child
To be fair, that can find things other than pro-drop (e.g. you may not be searching for imperatives, or infinitival roots, or other things). Given enough morphological annotations you could probably formulate the right search, but there's nothing wrong IMO with having a MISC annotation for pro-drop if it's useful for someone.
To be fair, that can find things other than pro-drop (e.g. you may not be searching for imperatives
That is true but the morphological features actually give you much more fine-grained options to specify what you are looking for (if they are present; unfortunately, the Frisian-Dutch treebank in UD does not have them).
But I am definitely not against adding some information on pro-drop in MISC.
work on frisian treebank is in progress, perhaps good to know
Dear all,
For our corpora, dictionaries and lexicons we use universaldependencies for terminology.
We define them in a tei odd, see https://bitbucket.org/fryske-akademy/tei-encoding/src/master/reusables/customization/v2_0/corpora_linguistics.odd
The following features we use but could not find in UD, perhaps these some of those are of use in UD.
Overview:
feature suffix: Boolean feature, Is this a suffix word in a compound, that usually cannot stand on its own?
feature oblique: case other than nominative or vocative
feature aux:
feature pronouns:
feature diminutive:
feature inflections:
feature valencys:
feature construction:
feature convertedfroms:
feature predicate:
value pos.unknown: pos unkown or irrelevant