gatk-workflows / five-dollar-genome-analysis-pipeline

Workflows used for WGS data processing -- replaced by https://github.com/gatk-workflows/gatk4-genome-processing-pipeline
https://gatk.broadinstitute.org/hc/en-us
BSD 3-Clause "New" or "Revised" License
57 stars 45 forks source link

Add genotyping and filtering option to single-sample workflow #2

Closed ldgauthier closed 6 years ago

ldgauthier commented 6 years ago

I have a version without imports that works. This version won't validate with womtool, probably because of the imports: ERROR: Cannot find reference to 'CheckContamination' for member access 'CheckContamination.contamination' (line 190, col 47):

The accuracy for the synthetic diploid sample (https://doi.org/10.1101/223297) was compared with a ~800 sample callset from production the sensitivity is very close for SNPs (slightly lower), better sensitivity for indels, and more FPs for SNPs and indels.

97.3% sensitivity for SNPs 99.4% precision for SNPs 65.9% sensitivity for indels* 98.7% precision for indels -- 3.8FP/Mb

*This dataset excludes 1bp indels, which are the most common and also the easiest. It also includes some very large events from PacBio not possible to call with Illumina.

SNP sensitivity is on par with that reported in the SynDip paper. Indel sensitivity is lower.

ldgauthier commented 6 years ago

Latest update is working with and without genotyping and filtering. I effectively overrode some of the existing tasks, which is maybe not good WDL style, but I needed new arguments.

For the record, the directory structure of the imports is a big pain for users (like me) running Cromwell in server mode: https://gatkforums.broadinstitute.org/gatk/discussion/comment/46211 It would be great if all the imports were inside some parent folder that could be easily zipped to submit the subworkflows.

bshifaw commented 6 years ago

The imports structure shouldn't be a problem to change for this repo since there's only one main workflow calling the tasks. But this will probably continue in dsde pipelines where there maybe different workflows calling the same tasks. So I'll have all imports under a directory called subworkflows and leave the main workflows and json as is. |_README.md |_germline_single_sample_workflow.hg38.inputs.json
|_germline_single_sample_workflow.wdl
|_subworkflows |.....|_split_large_readgroup.wdl
|.....|_unmapped_bam_to_aligned_bam.wdl |.....|_alignment.wdl
|.....|_bam_processing.wdl
|.....|_germline_variant_discovery.wdl
|.....|_qc.wdl
|.....|_utilities.wdl