incluir scripts para décupla validação cruzada

leoalenc commented 2 months ago

[x] incluir scripts para décupla validação cruzada de parsing com toquenização e etiquetas ouro
[x] melhorar scripts
[ ] incluir scripts para décupla validação cruzada de parsing de texto cru

Scripts para replicação dos experimentos deste artigo:

ALENCAR, Leonel Figueiredo de. A Universal Dependencies Treebank for Nheengatu. In: GAMALLO, Pablo; CLARO, Daniela; TEIXEIRA, António J. S.; REAL, Livy; GARCÍA, Marcos; OLIVEIRA, Hugo Gonçalo; AMARO, Raquel (Eds.). Proceedings of the 16th International Conference on Computational Processing of Portuguese, PROPOR 2024, Santiago de Compostela, Galicia/Spain, 12-15 March, 2024. Stroudsburg, PA, USA: Association for Computational Linguistics, 2024. v. 2, p. 37-54. Available at: https://aclanthology.org/2024.propor-2.8.

@inproceedings{DeAlencar2024a,
  author = "de Alencar, Leonel Figueiredo",
  editor  = {Pablo Gamallo and
            Daniela Claro and
            Ant{\'{o}}nio J. S. Teixeira and
            Livy Real and
            Marcos Garc{\'{\i}}a and
            Hugo Gon{\c{c}}alo Oliveira and
            Raquel Amaro},
  title = "A {U}niversal {D}ependencies Treebank for {N}heengatu",
  booktitle = {Proceedings of the 16th International Conference on Computational Processing of Portuguese, {PROPOR} 2024, Santiago de Compostela, Galicia/Spain, 12-15 March, 2024},
  pages = "37--54",
  volume = {2},
  publisher = {Association for Computational Linguistics},
  year = {2024},
  month = {3},
  url = "https://aclanthology.org/2024.propor-2.8",
  address = {Stroudsburg, PA, USA},
  abstract="We present UD_Nheengatu-CompLin, the inaugural treebank for Nheengatu, an endangered Indigenous language of Brazil with limited digital resources. This treebank stands as the largest among Indigenous American languages in version 2.13 of the Universal Dependencies collection. The developmental version comprises 1,336 trees, encompassing 13,246 tokens and 13,374 words. In a 10-fold cross-validation experiment using UDPipe 1.2, parsing with gold tokenization and gold tags achieved a labeled attachment score (LAS) of 81.17 ± 1.02, outperforming Yauti, the rule-based analyzer employed for sentence annotation.",
  isbn = {979-8-89176-062-2,
  doi = "10.5281/zenodo.11372209"}
}

leoalenc commented 2 months ago

@dominickmaia , incluí no commit os scripts para décupla validação cruzada de parsing com toquenização e etiquetas ouro.

dominickmaia commented 2 months ago

obrigada @leoalenc

leoalenc commented 2 months ago

Sobre a avaliação do parsing dependencial (métricas UAS, LAS etc.):

https://web.stanford.edu/~jurafsky/slp3/old_oct19/15.pdf

@inproceedings{nivre-fang-2017-universal,
    title = "{U}niversal {D}ependency Evaluation",
    author = "Nivre, Joakim  and
      Fang, Chiao-Ting",
    editor = "de Marneffe, Marie-Catherine  and
      Nivre, Joakim  and
      Schuster, Sebastian",
    booktitle = "Proceedings of the {N}o{D}a{L}i{D}a 2017 Workshop on Universal Dependencies ({UDW} 2017)",
    month = may,
    year = "2017",
    address = "Gothenburg, Sweden",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W17-0411",
    pages = "86--95",
}

CompLin / nheengatu

incluir scripts para décupla validação cruzada #556