ga4gh / benchmarking-tools

Repository for the GA4GH Benchmarking Team work developing standardized benchmarking methods for germline small variant calls
Apache License 2.0
192 stars 46 forks source link

Add links to datasets for benchmarking? #10

Closed jzook closed 7 years ago

jzook commented 8 years ago

David Haussler suggested including links to datasets that can be used for benchmarking in this repo, which I think is a good idea. I suggest we might want 2 categories of data for each genome - high-confidence vcf/bed files and raw data files. Does that make sense?

Here are the genomes I'll propose as a start:

jzook commented 8 years ago

Just to follow up on this, @pkrusche had suggested creating a simple human and machine readable file that gives information about truth sets and their locations. I'm thinking the following columns might be useful, and interested in other's suggestions:

  1. Coriell_DNA_ID (e.g., NA12878)
  2. NCBI_Biosample (e.g., SAMN03492678)
  3. NIST_ID (e.g., HG001)
  4. NIST_RM (e.g., 8398)
  5. VCF_link (e.g., ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/NISTIntegratedCalls_14datasets_131103_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.19_2mindatasets_5minYesNoRatio_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf.gz)
  6. BED_link (e.g., ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.19_2mindatasets_5minYesNoRatio.bed.gz)
  7. README_link (e.g., ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/README.NIST.v2.19.txt)
  8. Call_source (e.g., NIST/GIAB)

Does a tab-delimited file seem best to everyone for this?

RebeccaTruty commented 8 years ago

Maybe also a version number? Otherwise looks good!

pkrusche commented 8 years ago

PR #21 should address this.

jzook commented 8 years ago

@marghoob - Would you be interested in adding a description of the HuRef callset you made to our new list of benchmarking calls at https://github.com/ga4gh/benchmarking-tools/tree/master/resources/high-confidence-sets?

jzook commented 7 years ago

High-confidence sets added to https://github.com/ga4gh/benchmarking-tools/tree/master/resources/high-confidence-sets