GATK Pipeline Validation Testing

alex-hancock commented 7 years ago

@fnothaft has concerns (upon which he will elaborate) regarding the testing process.

jpfeil commented 7 years ago

The current plan is to run the Ashkenazim trio (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/) on an AWS cluster. This consists of three 2x250 Illumina samples. GIAB also provides high confidence variant calls for each sample for benchmarking. We plan to genotype each sample individually and filter variants using hard filters. The goal is to test a configuration that is similar to the one the ADAM/GATK comparison will use. It was brought up that testing three samples may not be sufficient. We can of course find more samples to run, but it will obviously cost more to run more samples. How many samples are you planning to run for the ADAM/GATK comparison?

fnothaft commented 7 years ago

10 for the head to head, 260 for ADAM only. I would probably go with a larger dataset than a trio; the Illumina Platinum Pedigree (http://www.illumina.com/platinumgenomes/) is something I would run through. Essentially, I'd like to push at least 1TB of data through the pipeline.

jpfeil commented 7 years ago

Okay! That sounds good. Alex and I will kick off a run with the Platinum Pedigree samples ASAP.

BD2KGenomics / toil-scripts

GATK Pipeline Validation Testing #470