broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 591 forks source link

Port IndelRealignment pipeline #3104

Open magicDGS opened 7 years ago

magicDGS commented 7 years ago

After discussion in #3084, I offer myself to port the indel realignment pipeline. After exploring the GATK3 implementation, I will split the port in the following independent tasks:

The previous port will be integrated in the IndelRealigner tool implementation.

magicDGS commented 7 years ago

Can you provide me some test data for include in the tools integration test, @vdauwera and/or @sooheelee? If not, I will try to use some BAM files already in the repository...

lbergelson commented 7 years ago

It's not clear to me that we want these tools in Gatk4. We deliberately didn't port them because we felt they were unnecessary going forward.

I understand that there are some legitimate use cases that require them: ex low coverage naive variant calling from high ploidy pools which haplotype caller would do poorly on. (Also, do we know that haplotype caller doesn't do well on those sorts of things? Maybe we should consider modifications there if it doesn't?) I'm not sure that supporting that use case is worth the added complexity of maintaining and supporting these tools. Especially since we don't provide a pileup based variant caller as part of gatk4...

@vdauwera Can you comment?

@sooheelee I'm not sure I agree with you that supporting this for mutect 1 is useful. A) We don't want to support the use of mutect 1 anymore and would like to encourage people to switch to mutect 2 which I think we now believe is a better variant caller for both snps and indels.
B) Mutect 1 users are already using gatk3, so they have access to these tools already. Mutect 1 also requires co-cleaning which I believe is a different but related tool to indel realignment.

For the variant review issue, we have thoughts on implementing a much better solution for variant review by creating an assembly plugin for igv.

magicDGS commented 7 years ago

I thought after the last comment of @vdauwera in the blog post about the removal that it will be possible to port it here as a contribution (without too much effort from your dev team). If the final solution is that this is not going to be maintained in GATK4, I would port this code to my own software if you give me the permission; but it is definitely something that the community is interested.

For example, I'm working with Pool-Seq data with hundreds of individuals together, so HaplotypeCaller is not a possibility in our case. I'm actually evaluating other approaches for realignment, such as ABRA or SRMA. I'm even thinking on implementing a new realigner based on the GATK's assembler engine and its PairHMM; but this requires more time for evaluation, and it will be nice to be able to compare with the current indel realignment pipeline. Anyway, I can close the issues and PRs in the gatk repo, and port them to my toolkit (ReadTools), to maintain the code for the community.

lbergelson commented 7 years ago

I asked around here and it seems like people think that it would be useful to have. If you're willing to do the work of porting we'll incorporate it. 👍

sooheelee commented 7 years ago

@magicDGS I provide some example data for a tutorial at https://gatkforums.broadinstitute.org/gatk/discussion/7156/howto-perform-local-realignment-around-indels. Search the page for tutorial_7156.tar.gz. I showcase illustrative sites within the tutorial and also in https://software.broadinstitute.org/gatk/blog?id=7847.

I'm actually new to test data so what cases are you hoping to test with the data? The snippet in the tutorial data is much larger than you need so it would be good narrow down the test case.

serge2016 commented 7 years ago

As I understand, there are two ways: 1) Update all guides that include tools not ported to GATK4 so users could use GATK4 to get the results as they did earlier. 2) Add all tools from GATK3.6 to GATK4

Otherwise non official forks will appear..

For now could you please add all tools to GATK4?

magicDGS commented 7 years ago

That example data from the tutorial is good @sooheelee, but maybe it could be reduced in size to avoid adding it to the large file directory? It will be nice to include that example in the RealignerTargetCreator PR (#3112)...

magicDGS commented 7 years ago

@sooheelee, I was coming back to the port this week and I found run the tutorial that you provide me, and the port of RealignerTargetCreator (#3112) is providing the same result. Nevertheless, I cannot add the test data to the resources because the reference used is huge (3GB). The 7156_snippet.bam is of a good size to include it, but it requires the whole reference because some pairs are mapped in other chromosomes. Can it be possible to get another example that it is limited to a couple of chromosomes, preferably 20 and 21 because a reference is already provided for that chromosomes? Thanks in advance!

In addition, I realized that the links to the data are broken in the tutorial; hopefully I downloaded it time ago, but it will be nice if they can be retrieved again in case I lose them.

stevendavis commented 6 years ago

Any update on this issue? What is the recommended way to use the RealignerTargetCreator and IndelRealigner in other non-GATK pipelines?

magicDGS commented 6 years ago

No update, sorry. The PRs are pending of review, and the data is still not available for proper testing...

stevekm commented 6 years ago

@lbergelson havent seen it mentioned but the biggest issue (for us at least) is that of licensing. GATK 4 is free for commercial use, while GATK 3 is not. Some of our non-commercial pipelines rely on these GATK 3 tools for processing data for use cases beyond GATK variant callers. Not having them available in GATK 4 means that these pipelines are difficult to move to a commercial setting. If the goal is to move everyone to GATK 4, then dropping support for these tools is counter productive. I am eagerly awaiting updates on their availability in GATK 4.

stevendavis commented 6 years ago

@stevekm I agree it would be beneficial to have the indel realignment tools in GATK 4. It helps with reproducing results from existing pipelines and resolves any licensing issues.

Having said that, you may want to have a look at the GATK 3 source code. RealignerTargetCreator and IndelRealigner are both in the public subfolder of the gatk-protected repo.

https://github.com/broadgsa/gatk-protected/tree/master/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/indels/RealignerTargetCreator.java

https://github.com/broadgsa/gatk-protected/tree/master/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/indels/IndelRealigner.java

I 'm not a legal expert, but the source code for RealignerTargetCreator and IndelRealigner both contain this comment which looks to me like permission to use in a commercial setting:

* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
* 
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
LukeGoodsell commented 6 years ago

Hi @magicDGS, thanks for tackling this. I would also like to be able to use IndelRealigner with GATK4. Where are you at with the porting so far? The last update to the PR is sooheelee providing you with some test data in March.

magicDGS commented 6 years ago

Sorry for all the interested people, but I had lately some deadlines unrelated with software development that took most of my time. Now I will have time to come back to other projects, and I would implement the port and tests with @sooheelee data this/next week. I hope that it works for you.

serge2016 commented 6 years ago

@magicDGS, thanks! We'll be waiting!

magicDGS commented 6 years ago

Sorry, I had several personal appointments and stuff to do the last weeks. I will inform you as soon as I can come back to the work on IndelRealignment

bartgrantham commented 5 years ago

Is there any update on this? From what I understand, most non-GATK variant callers (such as bcftools or platypus) could still benefit from this.

Additionally, the documentation for htslib still references GATK's IndelRealigner. If there's no replacement forthcoming, I will open an issue of htslib to have this updated.

Stikus commented 5 years ago

Any updates?

cmnbroad commented 5 years ago

I'm not aware of any activity on this, unless @magicDGS is still pursuing it.

igordot commented 2 years ago

There haven't been any comments here for about 3 years. Have there been any updates in a separate thread or offline? Is there any hope there may be any eventually?

serge2016 commented 2 years ago

+1

socameron commented 12 months ago

Has there been any development on porting RealignerTargetCreator and IndelRealigner? These tools may be helpful for low coverage variant calls.