Closed andrewyates closed 3 years ago
It looks like there's a bit of data wrangling to do here, which means these will be especially helpful additions.
TREC Genomics
Proposed hierarchy
highwire
(empty placeholder)highwire/trec-genomics
(docs)highwire/trec-genomics/2006
(docs, queries, qrels)highwire/trec-genomics/2007
(docs, queries, qrels)medline
(empty placeholder)medline/trec-genomics
(docs)medline/trec-genomics/2004
(docs, queries, qrels)medline/trec-genomics/2005
(docs, queries, qrels)Glancing through the data, some of these formats are non-standard. For instance, highwire/trec-genomics/*
include position information and use symbols like NOT, POSSIBLY, DEFINITELY rather than numeric relevance scores. I think I'll make the call to map these to integer scores (still documenting the definitions like all other qrels) for compatibility.
TREC CDS / PM
Proposed hierarchy
pmc
(empty placeholder)pmc/trec-cds-v1
(docs)pmc/trec-cds-v1/2014
(docs, queries, qrels)pmc/trec-cds-v1/2015
(docs, queries, qrels)pmc/trec-cds-v2
(docs)pmc/trec-cds-v2/2016
(docs, queries, qrels)trec-pm
(empty placeholder)trec-pm/v1
(docs)trec-pm/v1/2017
(docs, queries, qrels)trec-pm/v1/2018
(docs, queries, qrels)trec-pm/v2
(docs)trec-pm/v2/2019
(docs, queries, qrels)trec-pm/v2/2020
(docs, queries, qrels)There's two versions of the PMC dataset (v1 is used by CDS 14-15 and v2 is used by CDS 16). I made up the names of v1 and v2 because there doesn't appear to be a name for these particular subsets.
TREC Precision Medicine (PM) uses a corpus from several sources, so I'm moved it up the top top-level (like the proposal for trec-cast
). Here again, I made up the names of v1 and v2, since they otherwise do not appear to be named.
Remaining datasets:
Both TREC Genomics (Highwire collection) and CDS (PubMed central) are freely available. https://dmice.ohsu.edu/trec-gen/data.html http://www.trec-cds.org/