Closed seanmacavaney closed 3 years ago
Here's the paper: https://www.aclweb.org/anthology/2020.emnlp-main.340.pdf
The data are available here: http://www.cs.jhu.edu/~shuosun/clirmatrix/.
It looks like the format should be easy to parse: documents are in TSV and queries/qrels/scoreddocs are in jsonl.
The big design decision here is how to deal with the large number of language combinations (18k+).
"clirmatrix/bi139/en/es"
(where the first language code is either the query or document language). But this will greatly inflate the namespace."clirmatrix/bi139"
and provide the language codes via properties of the namedtuples, but there may be a lot of overhead doing this.ir_datasets.load("clirmatrix/bi139", doc_lang="en", query_lang="es")
.I think I prefer option (3) above, but I still need to figure out what a good mechanism is for handling #13.
It's appealing to always be able to specify a specific dataset via a string so it can be passed around easily. So maybe option 3(b) could take a cue from the URL convention like this: ir_datasets.load("clirmatrix/bi139?doc_lang=en&query_lang=es")
. Would be a shortcut for option 3 above (parameters would be first pared like URL parameters), but allow stuff to be specified as a string. ping #13.
Thanks for the great contribution @ssun32!
For cross-lingual IR.
CLIRMatrix: A massively large collection of bilingual & multilingual datasets for Cross-Lingual Information Retrieval. Shuo Sun and Kevin Duh.
Appearing at EMNLP'20. Doesn't seem like full text or data are available yet.