Open alex-hancock opened 7 years ago
The current plan is to run the Ashkenazim trio (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/) on an AWS cluster. This consists of three 2x250 Illumina samples. GIAB also provides high confidence variant calls for each sample for benchmarking. We plan to genotype each sample individually and filter variants using hard filters. The goal is to test a configuration that is similar to the one the ADAM/GATK comparison will use. It was brought up that testing three samples may not be sufficient. We can of course find more samples to run, but it will obviously cost more to run more samples. How many samples are you planning to run for the ADAM/GATK comparison?
10 for the head to head, 260 for ADAM only. I would probably go with a larger dataset than a trio; the Illumina Platinum Pedigree (http://www.illumina.com/platinumgenomes/) is something I would run through. Essentially, I'd like to push at least 1TB of data through the pipeline.
Okay! That sounds good. Alex and I will kick off a run with the Platinum Pedigree samples ASAP.
@fnothaft has concerns (upon which he will elaborate) regarding the testing process.