PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
258 stars 71 forks source link

Retrieving supporting reads from original BAM #630

Closed blex-max closed 1 year ago

blex-max commented 1 year ago

Hello,

We are integrating GRIDSS into an analysis pipeline and we would like to retrieve the genomic positions of the supporting reads of specific variants. We currently query the assemblies and then the reads by QNAME, but that is very inefficient, as it does not take advantage of the positional index of BAM's, and indexing all reads by QNAME if very time-consuming. Do you have a recommendation on how to best retrieve such reads?

Thanks for your work!

d-cameron commented 1 year ago

I use either the input.bam.gridss.working/input.bam.sv.bam (this contains the subset of the bam that gridss considers as potentially providing sv support), or use samtools view -N to subset by read name (NB: -N can be combined with positional filtering).

The assembly.bam.gridss.working/assembly.bam.sv.bam file is also useful to see the gridss assembly support (the assembly.bam file contains the contigs aligned to one side but the .sv.bam contains the full assembly split-read alignments.

d-cameron commented 1 year ago

Note that all support read names are not necessarily in the assembly bam - only those that provide assembly support. These don't usually differ by that much (RP vs ASRP & SR+BSC vs ASSR) but can differ quite substantially if assembly is unreliable (usually due to low coverage).

d-cameron commented 1 year ago

If you want all supporting reads then run gridss with the following configuration line and it'll add read name support to the vcf:

variantcalling.includeSupportingReadNames=true

d-cameron commented 1 year ago

The fast way of using samtools view -N and positional filtering to either side of the breakpoint (e.g. chr1:10000-chr2:20000 would use samtools view -N readnames.txt chr1:9000-11000 chr2:19000-21000) will get you most of the reads but won't get you some of the mates that were recruited via assembly.

For example if R1 has a soft clip alignment to chr1:9995 then gridss will try to assembly both R1 and R2 of that fragment regardless of where R2 aligns. If R2 successfully assembles then gridss will consider that read as ASRP support (but not direct RP support since R2 doesnt align to the chr2 side of the breakpoint so doesn't provide direct read pair support). If you want to get every read gridss co isder supporting the you need to read the whole bam file (or all of input.bsm.sv.bam). Gridss avoids this traversal by name-sorting the input.bam.sv.bam file, populating the R2 SAM tag, then position sorting it. This allows fast assembly of mate reads regardless of their alignment location.

The above scenario occurs around MEs. E.g. reads may align to multiple sine/line elements but when assembled together it becomes unambiguous as to which sine the breakpoint is on thus is callable by gridss with strong assembly(/assembly only) support.

The final difference is that the grids preprocessing step, as well as populating the R2 tag, realigns (using bwa mem) the unmapped bases of soft clipped reads and tries to turn them into split read alignments. This causes the gridss SR to be higher than one would this possible based on just the input.bam. This feature was added to support bowtie2 aligned bams (bowtie2 doesnt report aplit read alignments). Interestingly, even though bwa mem reports split reads, feeding the soft clipped portion of non-split read alignment back to bwa mem will identify split reads that bwa didn't initially report. These split read alignments are I. The input.bam.sv.bam so that's the best file to look at if you want to know what gridss is 'seeing' when calling variants.