broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

SplitNCigarReads.java - retain original read name somewhere for split parts of reads so we know where it derived from. #8703

Closed brianjohnhaas closed 7 months ago

brianjohnhaas commented 7 months ago

Feature request

Tool(s) or class(es) involved

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/SplitNCigarReads.java

Description

Modification requested.

When splitting a read based on the spliced RNA cigar (N cigar symbol), the new split read fragment that gets incorporated into the bam file has a new 'HC' identifier for the read name, and the original read name is lost. There are cases where we really need to access the original read name. Would it be possible to incorporate the original read name somewhere into the read alignment record, perhaps as a custom SAM tag or attribute, or as a prefix or suffix to the HC-identifier in the read name?

gokalpcelik commented 7 months ago

Hi @brianjohnhaas Are you sure about the HC identifier as it is actually placed into the bamout of HaplotypeCaller or Mutect2. SplitNCigarReads only maps those parts of the split read as secondary with the same name to be in check with SAM spec. Can you send us an example from IGV view?

brianjohnhaas commented 7 months ago

Thanks for the quick response! Here's an example:

image

It's from the bam out option of HapotypeCaller. I guess I was pointing to the wrong code after all. Maybe it's HC that would need to retain the original read name somehow?

gokalpcelik commented 7 months ago

Those HC tagged reads are artificially created by HaplotypeCaller Local Reassembly engine therefore there is really no connection to a direct read from the original bam but probably a connection to a whole bunch of supporting kmers and a debruijin graph that it belongs to. If you wish to keep variant supporting reads in a haplotype you may need to tag your reads using tools such as whatshap.

https://whatshap.readthedocs.io/en/latest/

brianjohnhaas commented 7 months ago

Oh! thanks for the info. What I'm trying to do is to figure out which reads are providing evidence for variants that are reported by HC. There are some cases where I can't find the exact reads that support the variant of interest, and I thought I might be able to get them from the HC bamout file. This is relevant for single cell data where the individual reads have info about cell barcodes so we can figure out which cells harbor the variant. I'm not sure there's a good way to do this when the HC reassembly defines the variant.

On Tue, Feb 27, 2024 at 12:16 PM Gökalp Çelik @.***> wrote:

Those HC tagged reads are artificially created by HaplotypeCaller Local Reassembly engine therefore there is really no connection to a direct read from the original bam but probably a connection to a whole bunch of supporting kmers and a debruijin graph that it belongs to.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/8703#issuecomment-1967219401, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX652DFCCCXO6KGTO4TYVYIH7AVCNFSM6AAAAABD4OZKJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRXGIYTSNBQGE . You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

gokalpcelik commented 7 months ago

Since each cell has a barcode wouldn't it be nice to use them as their Read Group ID and Sample Name within the BAM so that variant callers will distinguish each cell from their Sample Name and produce a multisample VCF for that variant site. Once IDs and Sample Names are split per cell you may be able to color them differently in IGV to even visually observe those events.

brianjohnhaas commented 7 months ago

In this context, for a given mutation, there might be a hundred or so reads and each cell is only contributing one to three reads. For other mutations, maybe there's less than 10 reads corresponding to less than 10 cells, and it can vary pretty dramatically. The total number of cells represented in a single sample can be thousands to tens of thousands, usually - but could be many more as the tech advances.

My hack for it at the moment is to encode both the cell barcode and the UMI information into the read name. Then, for each variant, I query the reads that overlap that variant in the bam file and analyze each read for supporting the variant or the REF allele - then I can count the reads according to the specific cells and also deal with any UMI redundancy per cell. This works pretty well except for the cases where the HC reassembly provides evidence for the variant and I can't track it to the originally aligned reads. Also, mostly I think the difficulty here relates to indels around homopolymers with our pacbio long isoform reads in our rna-seq variant pipeline that leverages the gatk rna-seq protocol with HC.

On Thu, Feb 29, 2024 at 8:58 AM Gökalp Çelik @.***> wrote:

Since each cell has a barcode wouldn't it be nice to use them as their Read Group ID and Sample Name within the BAM so that variant callers will distinguish each cell from their Sample Name and produce a multisample VCF for that variant site. Once IDs and Sample Names are split per cell you may be able to color them differently in IGV to even visually observe those events.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/8703#issuecomment-1971203108, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6LYHUXDUMGDU3AIFLYV4ZZLAVCNFSM6AAAAABD4OZKJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRGIYDGMJQHA . You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

gokalpcelik commented 7 months ago

For that purpose I would still suggest using Sample Name IDs and Read Group IDs along with Mutect2 to call variants therefore contribution of each cell to a mutation will be quantified in terms of Allele Fractions. Especially if you also disable downsampling it will be quite the data to analyze.

brianjohnhaas commented 7 months ago

Sounds good! After I get this current version of the pipeline released, I'll explore this further. Maybe we could follow-up on slack, or meet up at Broad at some point if you're on-site.

many thanks!

Brian

On Thu, Feb 29, 2024 at 9:32 AM Gökalp Çelik @.***> wrote:

For that purpose I would still suggest using Sample Name IDs and Read Group IDs along with Mutect2 to call variants therefore contribution of each cell to a mutation will be quantified in terms of Allele Fractions. Especially if you also disable downsampling it will be quite the data to analyze.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/8703#issuecomment-1971271149, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX37ASEZBUTV2HCSR4TYV45W5AVCNFSM6AAAAABD4OZKJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRGI3TCMJUHE . You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

jamesemery commented 7 months ago

Hey @brianjohnhaas. I'm not quite sure i fully understand what we can change to help you here. However there is a feature you might not be aware of in the bamout that can help you figure out which reads go with what haplotype. In our bamout we assign a tag (that for whatever reason it looks like IGV hides by default) called the XA tag. If you look at an IGV bamout and color by that tag you can see what reads were grouped by what haplotypes. If a read has no XA tag that means it was non-informative about any one haplotype over a second possible contender and thus it was not strong evidence one way or another. Given that in reality when genotyping we are taking the net likelihoods of all reads vs all haplotypes its somewhat of a simplification to say "this read comes from that haplotype" as often reads can be evidence for a number of haplotypes that might get called in aggregate but this simplification works in simple cases.

Below is a screenshot of what it looks like to do this in a very simple case. Hopefully this answers your question? I would be happy to go deeper into this if you would like.

Screenshot 2024-03-08 at 2 35 42 PM

brianjohnhaas commented 7 months ago

Thanks, James! I'll check that out - I think we might be able to leverage that.

best,

Brian

On Fri, Mar 8, 2024 at 2:38 PM jamesemery @.***> wrote:

Hey @brianjohnhaas https://github.com/brianjohnhaas. I'm not quite sure i fully understand what we can change to help you here. However there is a feature you might not be aware of in the bamout that can help you figure out which reads go with what haplotype. In our bamout we assign a tag (that for whatever reason it looks like IGV hides by default) called the XA tag. If you look at an IGV bamout and color by that tag you can see what reads were grouped by what haplotypes. If a read has no XA tag that means it was non-informative about any one haplotype over a second possible contender and thus it was not strong evidence one way or another.

Below is a screenshot of what it looks like to do this in a very simple case. Hopefully this answers your question? I would be happy to go deeper into this if you would like.

Screenshot.2024-03-08.at.2.35.42.PM.png (view on web) https://github.com/broadinstitute/gatk/assets/16102845/7d11ff5f-418e-4826-89fc-07535648a71f

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/8703#issuecomment-1986302994, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX773HENZD2UVB356GDYXIHTTAVCNFSM6AAAAABD4OZKJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGMYDEOJZGQ . You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas