Improve reference data practices

Our workflows often rely on reference datasets that we mount into the VM from a private NFS server. Basically, everything in this file with a /data/ prefix. Example below:

      "path": "/data/exome-seq/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar"
      "path": "/data/exome-seq/b37/Mills_and_1000G_gold_standard.indels.b37.vcf"
        "path": "/data/exome-seq/capture/xgen-exome-research-panel-targetsae255a1532796e2eaa53ff00001c1b3c-trimmed-chr.bed"
        "path": "/data/exome-seq/b37/dbsnp_138.b37.vcf"
        "path": "/data/exome-seq/b37/Mills_and_1000G_gold_standard.indels.b37.vcf"
        "path": "/data/exome-seq/b37/1000G_phase1.indels.b37.vcf"
        "path": "/data/exome-seq/capture/xgen-exome-research-panel-probesbe255a1532796e2eaa53ff00001c1b3c-trimmed-chr.bed"
      "path": "/data/exome-seq/b37/decoy/human_g1k_v37_decoy.fasta"
      "path": "/data/exome-seq/b37/dbsnp_138.b37.vcf"
      "path": "/data/exome-seq/b37/1000G_phase1.snps.high_confidence.b37.vcf"
      "path": "/data/exome-seq/b37/hapmap/hapmap_3.3.b37.vcf"
      "path": "/data/exome-seq/b37/omni/1000G_omni2.5.b37.vcf"

While some of the referenced datasets may seem obvious to those with domain expertise, their provenance is not made explicit. We also do not provide checksums, file sizes, or access to these files.

Let's come up with a strategy to address these shortcomings

Duke-GCB / bespin

Improve reference data practices #7