broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Be more discriminating about soft clips in active region determination #1094

Closed davidbenjamin closed 7 years ago

davidbenjamin commented 7 years ago

In active region determination for Mutect, and I believe also HaplotypeCaller, we count soft clips as a potential sign of a variant. This is because the aligner might soft clip the last few bases of a read that follow a deletion rather than call the deletion. For example, if the reference and read are:

TTCCAGAGTGTGTCAC (reference) TTC____GTCAC (read)

the alignment might choose to soft clip the GTCAC rather than call a deletion on the CAGAGTGT.

In somatic calling it is expensive to call too many active regions, so perhaps we should only count eg the soft-clipped bases GTCAC as evidence of variation if that kmer appears downstream in the reference.

@fleharty is this understanding of soft-clips being possible deletions (but not insertions or SNVs) correct?

fleharty commented 7 years ago

@davidbenjamin

I certainly agree with you that soft-clips can be due to deletions. It's not at all clear to me that they wouldn't happen with an insertion. Consider:

---ATGAACAGATATAACAGAT (reference) ---ATGAA(AGGTAA)CAGATATAACAGAT (read)

I don't see why a soft clip might not show up on this read after ATGAA. I'm not really sure I understand why some things are soft-clipped to be honest. I've seen plenty of things that were soft-clipped, but appear to match the reference perfectly (maybe I'm remembering this incorrectly).

I suspect that soft-clips are hardly ever correctly associated with SNVs though.

davidbenjamin commented 7 years ago

@fleharty Thanks for the input!

ldgauthier commented 7 years ago

You'll also likely see a difference in behavior for exomes vs genomes because exomes (still) use bwa-aln and genomes use bwa-mem. I don't remember the details off the top of my head, but I looked into it for monkol at one point when he saw what he thought was a suspicious proportion of soft clips in his data.


Laura Doyle Gauthier, PhD Computational Biologist Data Sciences and Data Engineering Broad Institute of MIT and Harvard gauthier@broadinstitute.org

On May 24, 2017 12:06 PM, "David Benjamin" notifications@github.com wrote:

@fleharty https://github.com/fleharty Thanks for the input!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk-protected/issues/1094#issuecomment-303771734, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdI_dXeyqfMrAMmLH6ZHXx8v_OYeSks5r9FV4gaJpZM4NkETN .

droazen commented 7 years ago

Issue moved to broadinstitute/gatk #3014 via ZenHub