dan-zeman / interset

Interset is an interlingua for morphosyntactic tag sets, needed in many tasks in natural language processing.
Other
5 stars 3 forks source link

Add features from Universal Dependencies #1

Open dan-zeman opened 8 years ago

dan-zeman commented 8 years ago

The first set of Universal Features, as defined within the Universal Dependencies 1.0 standard, come from Interset. But there are also language/treebank-specific extensions: new features and new values of existing features. Some of them were still taken from Interset, others are new and Interset does not know them.

We should define them so that we can read UD files without losing information.

dan-zeman commented 8 years ago

Language-specific features in UD 1.1 that are currently unknown to Interset (these lists are taken from the validation tool; AFAIK the tool uses lists that Filip collected from the data).

bg: Number=Ptan Number=Count ... co to je? To není látkové podstatné jméno, ne? da: PartType=Inf Foreign=Yes fi: Clitic=Han Clitic=Ka Clitic=Kaan Clitic=Kin Clitic=Ko Clitic=Pa Clitic=S Connegative=Yes Derivation=Inen Derivation=Ja Derivation=Lainen Derivation=Llinen Derivation=Minen Derivation=Sti Derivation=Tar Derivation=Ton Derivation=Ttaa Derivation=Ttain Derivation=U Derivation=Vs InfForm=1 InfForm=2 InfForm=3 PartForm=Agt PartForm=Neg PartForm=Past PartForm=Pres he: HebBinyan=HIFIL HebBinyan=HITPAEL HebBinyan=HUFAL HebBinyan=NIFAL HebBinyan=PAAL HebBinyan=PIEL HebBinyan=PUAL HebExistential=True HebSource=ConvUncertainHead HebSource=ConvUncertainLabel Negtive=Neg Prefix=Yes VerbType=Cop VerbType=Mod Xtra=Junk hu: Number[psed]=None Number[psed]=Sing Number[psor]=None Number[psor]=Plur Number[psor]=Sing Person[psor]=1 Person[psor]=3 Person[psor]=None sl: Variant=Bound ud (Universal Dependencies – i tady jsou zatím neschválené hodnoty – nějaké experimenty?): Animacy=Anim Animacy=Inan Aspect=Freq Aspect=FreqMod Aspect=Imp Aspect=Mod Aspect=None Aspect=Perf Case=Abe Case=Abl Case=Abs Case=Acc Case=Ade Case=All Case=Cau Case=Com Case=Dat Case=Del Case=Dis Case=Ela Case=Ess Case=Gen Case=Ill Case=Ine Case=Ins Case=Loc Case=Lat Case=Nom Case=Par Case=Sub Case=Sup Case=Tem Case=Ter Case=Tra Case=Voc Definite=2 Definite=Def Definite=Red Definite=Ind Degree=Cmp Degree=Comp Degree=None Degree=Pos Degree=Sup Degree=Abs Gender=Com Gender=Fem Gender=Masc Gender=Neut Mood=Cnd Mood=Imp Mood=Ind Mood=N Mood=Pot Mood=Sub Mood=Opt Negative=Neg Negative=Pos Negative=Yes Number=Com Number=Dual Number=None Number=Plur Number=Sing NumType=Card NumType=Dist NumType=Frac NumType=Gen NumType=Mult NumType=None NumType=Ord NumType=Sets Person=1 Person=2 Person=3 Person=None Poss=Yes PronType=AdvPart PronType=Art PronType=Default PronType=Dem PronType=Ind PronType=Int PronType=Neg PronType=Prs PronType=Rcp PronType=Rel PronType=Tot PronType=Clit Reflex=Yes Tense=Fut Tense=Imp Tense=Past Tense=Pres VerbForm=Fin VerbForm=Ger VerbForm=Inf VerbForm=None VerbForm=Part VerbForm=PartFut VerbForm=PartPast VerbForm=PartPres VerbForm=Sup VerbForm=Trans Voice=Act Voice=Cau Voice=Pass

dan-zeman commented 8 years ago

https://github.com/UniversalDependencies/docs/issues/252#issuecomment-185285306 will lead to change in Turkish from tense=nar to a new feature of evidentiality. Also note that since UD 1.3 Turkish data are available and we should look for the features there: https://github.com/UniversalDependencies/UD_Turkish