UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Citations for individual treebanks #772

Open robvanderg opened 3 years ago

robvanderg commented 3 years ago

Recently, I wanted to evaluate a parser on as many UD datasets as possible, and I started to request the datasets that are hosted without words in them (UD_English-ESL, UD_French-FTB, UD_Hindi_English-HIENCS, UD_Japanese-BCCWJ/, UD_Arabic-NYUAD/, UD_Mbya_Guarani-Dooley). However, for some of them I signed a contract stating that I have to cite their individual papers. I found this unfair compared to the other ~180 treebanks, so I started to collect all citations for them. Would it make sense to include these somehow in the official repos?

The bibs can be found here: https://github.com/machamp-nlp/machamp/blob/master/docs/cites.tar.gz

Treebanks without a paper have a default bib with a link to the repo and the treebank creators as authors (if we could find them)

ps. This was an exhausting task, and most likely there are some minor mistakes somewhere. Furthermore, we only got one citation per dataset, to make it fit in a table (and this decision was sometimes somewhat arbitrary).

ftyers commented 3 years ago

Thanks for doing that!

For Kazakh there is also:

@inproceedings{tyers_tl2015,
  author = {Tyers, Francis M. and Washington, Jonathan N.},
  title = {Towards a Free/Open-source Universal-dependency Treebank for Kazakh},
  booktitle = {3rd International Conference on Turkic Languages Processing,
  (TurkLang 2015)},
  pages = {276--289},
  year = {2015},
}

For Bambara:

@inproceedings{aplonova_2018,
author = {Aplonova, K. and Tyers, F. M.},
title = {Towards a dependency treebank for Bambara},
booktitle = { Proceedings of the 16th Conference on Treebanks and Linguistic Theories},
pages = {138--146},
year = 2018
}
robvanderg commented 3 years ago

Ah thanks for the additions, for now we've kept the number of citations to 1, simply because our use case was to put them in a table: https://www.aclweb.org/anthology/2021.eacl-demos.22.pdf . Would you say the one you posted is preferred over the other? (then I'll make sure to replace it in the next version of the paper)

Furthermore, if this is to be integrated into the official UD repo's multiple could be included per treebank of course.

ftyers commented 3 years ago

In terms of the Kazakh one, they are both as important. I looked at the table, it seems you could just have double lines for those treebanks that have two papers. Essentially the story is that we both worked independently on the Kazakh treebank and then found out about each other and decided to join forces. The current treebank is a joint effort between the two groups.

Stormur commented 3 years ago

Hi! Great work! For Latin LLCT, the correct main citation is:

Cecchini, F. M., Korkiakangas, T. and Passarotti, M. (2020). A New Latin Treebank for Universal Dependencies: Charters between Ancient Latin and Romance Languages. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC2020), Marseille, France, May. European Language Resources Association (ELRA).

jowagner commented 3 years ago

It would also be great if the preferred citation(s) were always in the treebank's readme. Of course, it would be problematic to push that info in there without approval. Maybe the readme template can be updated and treebank contributors contacted to add a section "Citation". Some treebanks use a section "Reference" in this way, others treat "References" like in a research paper.

robvanderg commented 3 years ago

Yes, I think it would be good to have 2 types of (groups of) citations in there:

If these would be documented in the README in a standardized way that would make it much simpler. And by default at least the first one can be filled with a standard tex.