Add links to datasets for benchmarking?

jzook commented 8 years ago

David Haussler suggested including links to datasets that can be used for benchmarking in this repo, which I think is a good idea. I suggest we might want 2 categories of data for each genome - high-confidence vcf/bed files and raw data files. Does that make sense?

Here are the genomes I'll propose as a start:

NA12878
- high-confidence vcf/bed from GIAB, Platinum Genomes, and RTG; maybe SVs from NIST, MetaSV, Bashir
- fastq from Illumina WGS, WES, Proton WES, others?
NA12877 from Platinum Genomes?
Venter genome from Bina?
Baylor genome?
GIAB PGP trios when calls are available next year

jzook commented 8 years ago

Just to follow up on this, @pkrusche had suggested creating a simple human and machine readable file that gives information about truth sets and their locations. I'm thinking the following columns might be useful, and interested in other's suggestions:

Coriell_DNA_ID (e.g., NA12878)
NCBI_Biosample (e.g., SAMN03492678)
NIST_ID (e.g., HG001)
NIST_RM (e.g., 8398)
VCF_link (e.g., ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/NISTIntegratedCalls_14datasets_131103_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.19_2mindatasets_5minYesNoRatio_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf.gz)
BED_link (e.g., ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.19_2mindatasets_5minYesNoRatio.bed.gz)
README_link (e.g., ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/README.NIST.v2.19.txt)
Call_source (e.g., NIST/GIAB)

Does a tab-delimited file seem best to everyone for this?

RebeccaTruty commented 8 years ago

Maybe also a version number? Otherwise looks good!

pkrusche commented 8 years ago

PR #21 should address this.

jzook commented 8 years ago

@marghoob - Would you be interested in adding a description of the HuRef callset you made to our new list of benchmarking calls at https://github.com/ga4gh/benchmarking-tools/tree/master/resources/high-confidence-sets?

jzook commented 7 years ago

High-confidence sets added to https://github.com/ga4gh/benchmarking-tools/tree/master/resources/high-confidence-sets

ga4gh / benchmarking-tools

Add links to datasets for benchmarking? #10