TrinityCTAT / ctat-mutations

Mutation detection using GATK4 best practices and latest RNA editing filters resources. Works with both Hg38 and Hg19
https://github.com/TrinityCTAT/ctat-mutations
Other
73 stars 18 forks source link

HaplotypeCaller - Intervals option #118

Open gremame opened 1 year ago

gremame commented 1 year ago

Hello Brian, I tried passing the --intervals parameter from the GATK HaplotypeCaller to the --HC_xtra_args parameter that ctat-mutations provides. I saw in the logs that this parameter is correctly being passed to HaplotypeCaller when calling the program:

 gatk --java-options "-Xmx6000m" \
HaplotypeCaller \
-R output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/-1676360762/ref_genome.fa \
-I /output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/-1234849600/sample.bqsr.bam \
-O sample.vcf.gz \
-dont-use-soft-clipped-bases --stand-call-conf 20 --recover-dangling-heads true --intervals /input/file3/chr2-208247000-208249000.bed --max-mnp-distance 0 \
-L /output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/826508261/0001-scattered.interval_list

There it is:

--intervals /input/file3/chr2-208247000-208249000.bed

However, it seems that my call to the interval parameter is being overridden by an additional (I guess internal) use of it:

-L /output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/826508261/0001-scattered.interval_list

I was wondering if this interval parameter could become a parameter in the ctat-mutations pipeline and be handled in such way that allows to limit the scope of the analysis and reduce runtime. Best regards! David

brianjohnhaas commented 1 year ago

Hi David,

I think this is just HaplotypeCaller internally splitting up the interval list so each thread can run on separate targets, and then it combines the results afterwards. Maybe the issue is that it's not running as multithreaded here... I can look further into that if that's the case.

best,

~brian

On Sun, Nov 27, 2022 at 12:48 PM David @.***> wrote:

Hello Brian, I tried passing the --intervals parameter from the GATK HaplotypeCaller to the --HC_xtra_args parameter that ctat-mutations provided. I saw in the logs that this parameter is correctly being passed to HaplotypeCaller when calling the program:

gatk --java-options "-Xmx6000m" \ HaplotypeCaller \ -R output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/-1676360762/ref_genome.fa \ -I /output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/-1234849600/sample.bqsr.bam \ -O sample.vcf.gz \ -dont-use-soft-clipped-bases --stand-call-conf 20 --recover-dangling-heads true --intervals /input/file3/chr2-208247000-208249000.bed --max-mnp-distance 0 \ -L /output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/826508261/0001-scattered.interval_list

There it is:

--intervals /input/file3/chr2-208247000-208249000.bed`

However, it seems that my call to the interval parameter is being overridden by an additional (I guess internal) use of it:

-L /output/sample/cromwell-executions/ctat_mutations/55901d74-65b6-426f-9636-69930c907f08/call-HaplotypeCallerInterval/shard-1/inputs/826508261/0001-scattered.interval_list

I was wondering if this interval parameter could become a parameter in the ctat-mutations pipeline and be handled in such way that allows to limit the scope of the analysis and reduce runtime. Best regards! David

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/118, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX2BMMTYGTWNFKN3PVLWKONFZANCNFSM6AAAAAASMT7Z3Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

gremame commented 1 year ago

Hello Brian, I went through the logs again, I found that as a prior step to HaplotypeCaller the workflow is calling another GATK tool: SplitIntervals

gatk --java-options "-Xmx1500m" \
SplitIntervals \
-R /output/L19-5858/cromwell-executions/ctat_mutations/a29788ae-c656-45d2-859a-6c13c9b65ae1/call-SplitIntervals/inputs/-1676360762/ref_genome.fa \
-scatter 10 \
-O interval-files \

Fortunately, this tool also accepts the --intervals parameter, according to their documentation. So it seems like the solution to this problem could be easily resolved (hopefully): If intervals becomes an input for the ctat-mutations pipeline this input can then be passed to SplitIntervals, without further modifications, the pipeline will take care of the rest, as the output from SplitIntervals is already being passed into HaplotypeCaller. I believe that's all we needed to be able to limit the analysis to the regions defined in the intervals file. What do you think? Best regards, David

brianjohnhaas commented 1 year ago

Hi David,

I'll look into this shortly and get back to you.

many thanks,

~b

On Tue, Nov 29, 2022 at 8:53 AM David @.***> wrote:

Hello Brian, I went through the logs again, I found that as a prior step to HaplotypeCaller the workflow is calling another GATK tool: SplitIntervals

gatk --java-options "-Xmx1500m" \ SplitIntervals \ -R /output/L19-5858/cromwell-executions/ctat_mutations/a29788ae-c656-45d2-859a-6c13c9b65ae1/call-SplitIntervals/inputs/-1676360762/ref_genome.fa \ -scatter 10 \ -O interval-files \

Fortunately, this tool also accepts the --intervals parameter, according to their documentation. So it seems like the solution to this problem could be easily resolved (hopefully): If intervals becomes an input for the ctat-mutations pipeline this input can then be passed to SplitIntervals, without further modifications, the pipeline will take care of the rest, as the output from SplitIntervals is already being passed into HaplotypeCaller. I believe that's all we needed to be able to limit the analysis to the regions defined in the intervals file. What do you think? Best regards, David

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

brianjohnhaas commented 1 year ago

Hi David,

I should have a version with the intervals option supported later today. Are you using ctat-mutations via docker or singularity, or through a native installation?

best,

~b

On Tue, Nov 29, 2022 at 10:31 AM Brian Haas @.***> wrote:

Hi David,

I'll look into this shortly and get back to you.

many thanks,

~b

On Tue, Nov 29, 2022 at 8:53 AM David @.***> wrote:

Hello Brian, I went through the logs again, I found that as a prior step to HaplotypeCaller the workflow is calling another GATK tool: SplitIntervals

gatk --java-options "-Xmx1500m" \ SplitIntervals \ -R /output/L19-5858/cromwell-executions/ctat_mutations/a29788ae-c656-45d2-859a-6c13c9b65ae1/call-SplitIntervals/inputs/-1676360762/ref_genome.fa \ -scatter 10 \ -O interval-files \

Fortunately, this tool also accepts the --intervals parameter, according to their documentation. So it seems like the solution to this problem could be easily resolved (hopefully): If intervals becomes an input for the ctat-mutations pipeline this input can then be passed to SplitIntervals, without further modifications, the pipeline will take care of the rest, as the output from SplitIntervals is already being passed into HaplotypeCaller. I believe that's all we needed to be able to limit the analysis to the regions defined in the intervals file. What do you think? Best regards, David

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

gremame commented 1 year ago

Hello Brian, sounds great, many thanks. I'm using ctat-mutations via docker. Best regards, David

brianjohnhaas commented 1 year ago

sounds good. I'll have an updated docker for you to try shortly.

best,

~b

On Wed, Nov 30, 2022 at 10:30 AM David @.***> wrote:

Hello Brian, sounds great, many thanks. I'm using ctat-mutations via docker. Best regards, David

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/118#issuecomment-1332349623, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX3F3GOB4STZQDLGFNTWK5XJLANCNFSM6AAAAAASMT7Z3Q . You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

brianjohnhaas commented 1 year ago

Hi David - can you give this Docker a try?

 trinityctat/ctat_mutations:3.3.0-predev

There'll be an --intervals parameter now that you can use to give your interval list for passing on to gatk.

best,

~brian

gremame commented 1 year ago

It worked! :smile: I ran the new version using an interval file I created and the FASTQ files provided as example in the documentation, I was happy to see that the only variants in the results were those that overlapped the BED file. I then tried the same with one of the samples I'm analyzing, in my first test I used FASTQ files and in the second I provided a BAM file. In both cases I only obtained variants overlapping the BED. Thanks a lot for adding this feature to the workflow!

brianjohnhaas commented 1 year ago

Great! Thx for the update. This will go into the next release

On Thu, Dec 1, 2022 at 3:21 AM David @.***> wrote:

It worked! 😄 I ran the new version using an interval file I created and the FASTQ files provided as example in the documentation, I was happy to see that the only variants in the results were those that overlapped the BED file. I then tried the same with one of the samples I'm analyzing, in my first test I used FASTQ files and in the second I provided a BAM file. In both cases I only obtained variants overlapping the BED. Thanks a lot for adding this feature to the workflow!

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/118#issuecomment-1333386102, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX52HND7FJW4UYO3C7LWLBN2JANCNFSM6AAAAAASMT7Z3Q . You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas