UniversalDependencies / UD_Italian-TWITTIRO

Other
1 stars 0 forks source link

Mis-lemmatization of gli #7

Closed AngledLuffa closed 2 years ago

AngledLuffa commented 2 years ago

There are several instances where gli as a PRON is lemmatized as lo, which I believe to be incorrect. The standard used in most of this dataset, along with VIT and ISDT, is to lemmatize it as gli.

For example:

# sent_id = test_49
# text = Mario Monti per la manovra ha chiesto tempo al Tempo, ma il Tempo gli ha risposto Non ne ho più. E lui si è incazzato #supermario
# twittiro = EXPLICIT   EX:CONTEXT SHIFT
16      gli     gli     PRON    PC      Clitic=Yes|Gender=Masc|Number=Sing|Person=3|PronType=Prs        18      iobj    _       _

vs

# sent_id = test_132
# text = Gli amici di Rouhani, per il suo compleanno, gli fanno la sorpresa della ragazza che entra nella torta. [@user]
# twittiro = IMPLICIT   OTHER
11      gli     lo      PRON    PC      Clitic=Yes|Gender=Masc|Number=Sing|Person=3|PronType=Prs        12      iobj    _       _

There are three other gli/lo in the train section and one gli/il

# sent_id = train_691
# text = Ma tutti voi che twittate pubblicizzando #labuonascuola ci fate o ci siete? O vi paga Renzi per fargli propaganda?
# twittiro = EXPLICIT   RHETORICAL QUESTION
20      gli     il      PRON    PC      Clitic=Yes|PronType=Prs 19      iobj    _       _
AngledLuffa commented 2 years ago

There are also a few instances of la/lo or la/il, which again is not standard in this dataset or others. Most of the time, that one is lemmatized as la

le/lo occurs a few times instead of le/le

li/lo instead of li/li

lo/il occurs twice, usually it is lo/lo

se/si occurs twice instead of `se/se

I can provide a PR for these items if that will help

AngledLuffa commented 2 years ago

Excellent, thanks!