DaReCzech - Githubissues

Dataset Information:

A rather large dataset in Czech.

Links to Resources:

Repo: https://github.com/Seznam/DaReCzech
Paper: https://arxiv.org/pdf/2112.01810.pdf

Dataset ID(s) & supported entities:

dareczech (docs)
dareczech/train (docs, queries, qrels)
dareczech/train/small (docs, queries, qrels)
dareczech/dev (docs, queries, qrels)
dareczech/test (docs, queries, qrels)

It appears to be a re-ranking dataset, so scorddocs will also likely be provided.

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

[ ] Dataset definition (in ir_datasets/datasets/[topid].py)
[ ] Tests (in tests/integration/[topid].py)
[ ] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
[ ] Documentation (in ir_datasets/etc/[topid].yaml)
- [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
[ ] Downloadable content (in ir_datasets/etc/downloads.json)
- [ ] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The dataset is only available on request and after accepting a disclaimer. So it will be another semi-manual dataset with instructions provided for access.

allenai / ir_datasets

DaReCzech #144