UniversalDependencies / UD_Portuguese-GSD

Brazilian Portuguese data from the Google Universal Dependency Treebanks 2.0.
Other
20 stars 7 forks source link

adding lemmas #8

Open arademaker opened 6 years ago

arademaker commented 6 years ago

I would like to add lemas in this corpus. Let us use this issue for discussing possible alternatives for that.

dan-zeman commented 6 years ago

Definitely! (Missing lemmas are another reason why I do not want to see the treebank in the shared task. We are evaluating lemmatization this year.)

I think you used a freely available morphological analyzer to re-tag (and lemmatize?) Bosque, right? Could you apply the same pipeline to GSD? Would disambiguation involve a lot of manual work?

marcospln commented 6 years ago

I could help with this task (not in the short-term, however).

ceramisch commented 6 years ago

I generated automatic LEMMAS + morphological FEATS obtained through UDPipe using 2.0 models trained on Bosque and applied to this GSD treebank, because it will be released as part of the MWE-annotated corpora in the PARSEME shared task. But I'm not sure this would be a good starting point...

marcospln commented 6 years ago

Nice! I would say it's a good starting point. Maybe we could use an external dictionary (DELAF_PB?) to mark as ambiguous those entries (token+TAG) with more than one lemma. Thus we can focus on reviewing only the ambiguous ones.

arademaker commented 6 years ago

@ceramisch I know that UDPipe makes many mistakes in the morphology. It seems to be driven by the suffixes for all unknown word. But @marcospln idea could be useful. But you said that you generated the lemmas and feats, did you pushed your changes in the dev branch?

ceramisch commented 6 years ago

I like @marcospln 's idea too.

I didn't push the changes, I'm currently working on the PARSEME corpora release.

I will probably create another branch and push the changes by the end of the week, OK?

marcospln commented 6 years ago

I'd done a script to convert DELAF_PB to EAGLES/FreeLing format, and another one from EAGLES/FreeLing to UD (POS + feats). I guess it might need some reviews concerning the EAGLES tagset and the UD morphological features (I used it in the Galician-TreeGal treebank but maybe there are some differences regarding the features used in UD).

With the DELAF_PB (UD version) we could (a) use it to mark ambiguous entries or (b) train an UDPipe model using the dictionary (I hope this approach involve less mistakes than the generic model).

arademaker commented 6 years ago

Features must be revisied.

arademaker commented 4 years ago

We may use https://github.com/LFG-PTBR/MorphoBr