CLIRMatrix - Githubissues

seanmacavaney commented 4 years ago

For cross-lingual IR.

CLIRMatrix: A massively large collection of bilingual & multilingual datasets for Cross-Lingual Information Retrieval. Shuo Sun and Kevin Duh.

Appearing at EMNLP'20. Doesn't seem like full text or data are available yet.

seanmacavaney commented 4 years ago

Here's the paper: https://www.aclweb.org/anthology/2020.emnlp-main.340.pdf

seanmacavaney commented 4 years ago

The data are available here: http://www.cs.jhu.edu/~shuosun/clirmatrix/.

It looks like the format should be easy to parse: documents are in TSV and queries/qrels/scoreddocs are in jsonl.

The big design decision here is how to deal with the large number of language combinations (18k+).

Could do it as something like "clirmatrix/bi139/en/es" (where the first language code is either the query or document language). But this will greatly inflate the namespace.
An alternative is to provide all combinations via "clirmatrix/bi139" and provide the language codes via properties of the namedtuples, but there may be a lot of overhead doing this.
Or use #13 to specify the language pairs when loading the dataset like ir_datasets.load("clirmatrix/bi139", doc_lang="en", query_lang="es").

I think I prefer option (3) above, but I still need to figure out what a good mechanism is for handling #13.

seanmacavaney commented 3 years ago

It's appealing to always be able to specify a specific dataset via a string so it can be passed around easily. So maybe option 3(b) could take a cue from the URL convention like this: ir_datasets.load("clirmatrix/bi139?doc_lang=en&query_lang=es"). Would be a shortcut for option 3 above (parameters would be first pared like URL parameters), but allow stuff to be specified as a string. ping #13.

seanmacavaney commented 3 years ago

Thanks for the great contribution @ssun32!

allenai / ir_datasets

CLIRMatrix #4