Duke-GCB / bespin

Reproducible genomic workflows in the cloud
1 stars 0 forks source link

Improve reference data practices #7

Open dleehr opened 5 years ago

dleehr commented 5 years ago

Our workflows often rely on reference datasets that we mount into the VM from a private NFS server. Basically, everything in this file with a /data/ prefix. Example below:

      "path": "/data/exome-seq/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar"
      "path": "/data/exome-seq/b37/Mills_and_1000G_gold_standard.indels.b37.vcf"
        "path": "/data/exome-seq/capture/xgen-exome-research-panel-targetsae255a1532796e2eaa53ff00001c1b3c-trimmed-chr.bed"
        "path": "/data/exome-seq/b37/dbsnp_138.b37.vcf"
        "path": "/data/exome-seq/b37/Mills_and_1000G_gold_standard.indels.b37.vcf"
        "path": "/data/exome-seq/b37/1000G_phase1.indels.b37.vcf"
        "path": "/data/exome-seq/capture/xgen-exome-research-panel-probesbe255a1532796e2eaa53ff00001c1b3c-trimmed-chr.bed"
      "path": "/data/exome-seq/b37/decoy/human_g1k_v37_decoy.fasta"
      "path": "/data/exome-seq/b37/dbsnp_138.b37.vcf"
      "path": "/data/exome-seq/b37/1000G_phase1.snps.high_confidence.b37.vcf"
      "path": "/data/exome-seq/b37/hapmap/hapmap_3.3.b37.vcf"
      "path": "/data/exome-seq/b37/omni/1000G_omni2.5.b37.vcf"

While some of the referenced datasets may seem obvious to those with domain expertise, their provenance is not made explicit. We also do not provide checksums, file sizes, or access to these files.

Let's come up with a strategy to address these shortcomings