common-workflow-library / legacy

Deprecated
https://github.com/common-workflow-library/bio-cwl-tools
Apache License 2.0
100 stars 62 forks source link

Message from GATK team #129

Closed knoblett closed 7 years ago

knoblett commented 7 years ago

Hello, We saw that you have created some wrappers and a pipeline script for CWL users to run GATK. That is really great, and we appreciate it! That being said, we have a few suggestions to improve them.

Your pipeline unfortunately implements an outdated version of our Best Practices for germline short variant discovery, as described in the 2013 publication by Van der Auwera et al. Your links and diagrams, however, reference our current Best Practices implementation. Our pipeline has undergone a few iterative updates since the publication of the paper itself, the most notable of which is the joint genotyping aspect of the workflow. This was a big breakthrough in the GATK 3.x versions--we split the variant calling functionality of HaplotypeCaller into two parts, allowing us to individually generate the intermediate files called GVCFs, then jointly call them to produce variant calls. We would encourage you to update your implementation to reflect our most recent Best Practices. However, if you’re not able to tackle that right now, we would like to ask you to at least rename it to avoid confusion with the current Best Practices, as they are very different. If you do choose to update your implementation, specifying “Best Practices as of Month Year” is the most accurate long-term.

We have shared some of our pipeline scripts (written in WDL) to serve as a reference implementation of our current Best Practices. You might find these scripts useful for updating the CWL implementation. See here for an implementation of the first part of the pipeline, which covers data pre-processing through generating the intermediate GVCF. Here you can find the second part of the pipeline, which jointly calls those GVCFs, then performs variant filtering and genotype refinement. We also have some accompanying documentation which can be found here. This documentation describes the main implementation decisions for the pipeline script of the same name.

Additionally, did you know that you can use GATK's tool-specific JSON files to generate your scripts? Using that, I was able to auto-generate these WDL tasks, which are easy to keep up-to-date with the latest GATK version. Let me know if you have any questions regarding the pipeline implementations I have referenced. We are very keen to help make GATK pipelines available to all users, regardless of the pipelining language they choose to work with.

mr-c commented 7 years ago

Thank you @knoblett for the update. All contributions to this repository are done on a voluntary basis.

We will gladly review pull requests to implement your ideas.

tfmorris commented 7 years ago

Best guess as to who authored the pseudonymous issue - Katharine Noblett: https://www.linkedin.com/in/katharine-noblett-30765a97/

Best guess as to the tool which could be tweaked to autogenerate industry-standard CWL from GATK JSON instead of proprietary WDL: https://github.com/broadinstitute/wdl/blob/develop/scripts/wrappers/gatk/gatkToWdlWrapper.py

@knoblett clearly has the knowledge and skill to improve the GATK example pipeline. Looking forward to the PR.

mr-c commented 7 years ago

Hello @knoblett , thank you for your enthusiasm, though it seems a bit aggressive.

Reviewing the situation, I am reminded that GATK (the software) isn't freely redistributable and is therefore untestable using public infrastructure. As every other piece of software described in this repo is freely available I suggest that we move the GATK/CWL descriptions to their own repo where those who are interested in maintaining them can do so.

After the reorganization the new README will be more prominent and that should provide clarity about the status of the workflow description.

@briandoconnor @yassineS @hocinebendou What do you think?

tetron commented 7 years ago

Also there's https://github.com/common-workflow-language/wdl2cwl (ping @anton-khodak) which may be able to autoconvert the GATK best practices WDL.

mr-c commented 7 years ago

@knoblett the GATK workflows and tools are being moved to https://github.com/h3abionet/h3agatk/tree/master/workflows/GATK

jrandall commented 6 years ago

Posting here because this may be of interest to people who find this issue while searching for GATK CWL CommandLineTool definitions. We have written a conversion tool that generates CWL directly from the GATK (json) documentation: https://github.com/wtsi-hgi/gatk-cwl-generator

The releases from the same repo (https://github.com/wtsi-hgi/gatk-cwl-generator/releases) include CWL CommandLineTool definitions for all documented GATK commands for GATK versions 3.5, 3.6, 3.7, and 3.8 (as well as preview CWL for GATK 4 beta).

mr-c commented 6 years ago

@jrandall This is really cool, thank you for sharing! You are very welcome to announce this on the CWL mailing list — or I can do so, with your permission.