Open arademaker opened 6 years ago
Definitely! (Missing lemmas are another reason why I do not want to see the treebank in the shared task. We are evaluating lemmatization this year.)
I think you used a freely available morphological analyzer to re-tag (and lemmatize?) Bosque, right? Could you apply the same pipeline to GSD? Would disambiguation involve a lot of manual work?
I could help with this task (not in the short-term, however).
I generated automatic LEMMAS + morphological FEATS obtained through UDPipe using 2.0 models trained on Bosque and applied to this GSD treebank, because it will be released as part of the MWE-annotated corpora in the PARSEME shared task. But I'm not sure this would be a good starting point...
Nice! I would say it's a good starting point. Maybe we could use an external dictionary (DELAF_PB?) to mark as ambiguous those entries (token+TAG) with more than one lemma. Thus we can focus on reviewing only the ambiguous ones.
@ceramisch I know that UDPipe makes many mistakes in the morphology. It seems to be driven by the suffixes for all unknown word. But @marcospln idea could be useful. But you said that you generated the lemmas and feats, did you pushed your changes in the dev
branch?
I like @marcospln 's idea too.
I didn't push the changes, I'm currently working on the PARSEME corpora release.
I will probably create another branch and push the changes by the end of the week, OK?
I'd done a script to convert DELAF_PB to EAGLES/FreeLing format, and another one from EAGLES/FreeLing to UD (POS + feats). I guess it might need some reviews concerning the EAGLES tagset and the UD morphological features (I used it in the Galician-TreeGal treebank but maybe there are some differences regarding the features used in UD).
With the DELAF_PB (UD version) we could (a) use it to mark ambiguous entries or (b) train an UDPipe model using the dictionary (I hope this approach involve less mistakes than the generic model).
Features must be revisied.
We may use https://github.com/LFG-PTBR/MorphoBr
I would like to add lemas in this corpus. Let us use this issue for discussing possible alternatives for that.