allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

(Bio)medical TREC datasets #28

Closed andrewyates closed 3 years ago

andrewyates commented 3 years ago

Both TREC Genomics (Highwire collection) and CDS (PubMed central) are freely available. https://dmice.ohsu.edu/trec-gen/data.html http://www.trec-cds.org/

seanmacavaney commented 3 years ago

It looks like there's a bit of data wrangling to do here, which means these will be especially helpful additions.

TREC Genomics

Proposed hierarchy

Glancing through the data, some of these formats are non-standard. For instance, highwire/trec-genomics/* include position information and use symbols like NOT, POSSIBLY, DEFINITELY rather than numeric relevance scores. I think I'll make the call to map these to integer scores (still documenting the definitions like all other qrels) for compatibility.

TREC CDS / PM

Proposed hierarchy

There's two versions of the PMC dataset (v1 is used by CDS 14-15 and v2 is used by CDS 16). I made up the names of v1 and v2 because there doesn't appear to be a name for these particular subsets.

TREC Precision Medicine (PM) uses a corpus from several sources, so I'm moved it up the top top-level (like the proposal for trec-cast). Here again, I made up the names of v1 and v2, since they otherwise do not appear to be named.

seanmacavaney commented 3 years ago

Remaining datasets: