GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

spitters : I don't get understand the code :-) #33

Closed lindenb closed 6 years ago

lindenb commented 6 years ago

cross posted on biostars : https://www.biostars.org/p/292925/

Hi, Could you please explain your code here : https://github.com/GregoryFaust/samblaster/blob/master/samblaster.cpp#L1125 (with a figure ?)

i don't really understand what is the logic when handling the clipped reads and how you're handling the INS or the DELS. As far as I understand the code, the start/end of a read can be inverted if the read is on '+' or '-' strand (?) and it confuses me.

Thank you ! :-)

GregoryFaust commented 6 years ago

I am writing this as a more general response about how splitters are chosen according to the input parameters to samblaster so that I can eventually include it in the documentation for the tool.

We wish to identify Structural Variant (SV) breakpoints by finding adjacent alignments for single reads with certain characteristics. These "split-read" alignments can identify potential SV breakpoints with more accuracy than using discordant read pairs, but we need to use some caution in choosing which adjacent alignments to consider in order to reduce the number of false positive calls.

If the adjacent alignments are on separate chromosomes, or the reads are aligned to different strands, they can be part of many types of SV events, and they need only meet the --minNonOverlap requirements to be included in the --splitterFile output.

However, if they are on the same chrom and strand, as checked in line 1119, then they represent an insertion, or more likely a deletion. For clarity of explanation, I have included a diagram below of a +/+ (positive strand) deletion scenario. In the diagram, we show the plot of aligned base positions in the query and in the reference. When a string of base pairs match (or occasionally mismatch so long as they are an M CIGAR operation), the query and reference offset will go up by one for each, and the plot will form a line at a 45-degree angle. This explains the use of the term diagonal in describing alignments.

For insertions/deletions, we wish to make sure that the SV will be of at least a certain size as determined by the --minIndelSize parameter to samblaster which defaults to 50. One might think the easiest way to measure this would be to just calculate the gap in the reference offsets between the two alignments. But this gives inaccurate results because one or both alignments will not always start/end exactly at the hypothetical breakpoint as indicated by the gray portions of the alignments in the diagram. This is caused by two factors.

  1. First, the alignments can be truncated/clipped due to small sequencing errors near the end of the alignment. Using common default mismatch penalties of 3 to 4, the last 4 to 5 base pairs can be clipped by a single mismatched base. Default penalties for single base pair indels are typically around 7, meaning one of these near the end of an alignment can clip as many as 8 bases.
  2. Second, the alignments can overrun the breakpoint if there is homology of the deleted region and the region just before and/or after the deletion.

Note that neither clipping or overrun near the breakpoint affect the diagonal of an alignment. That is why we measure the size of the potential SV by using the distance between the two diagonals of the adjacent alignments. But where on the alignments should we measure the diagonal? As you can see in the diagram, the indel in the middle of the left alignment changes the diagonal at the start of the alignment relative to the diagonal of the end of the alignment. That is why we measure the diagonals at the breakpoint. Therefore, in this +/+ case, we measure the size of the SV by comparing the diagonal of the end of the left alignment with the diagonal at the start of the right alignment (as in lines 1133-1135).

To understand -/- case, it is helpful to read the relevant sentence from the SAM format specification:

Bit 0x10 indicates whether SEQ has been reverse complemented and QUAL reversed.

In terms of the diagram below, this means that as we increase the index into the reverse strand alignment, we are actually moving backward in offset on both the original FASTQ query and the reference sequence. So, for a -/- scenario, the diagram below looks identical except that each alignment starts in the upper right of its current location and continues towards the lower left in the diagram where the arrowheads now belong. Therefore, to measure the diagonals at the breakpoint, we need to take the diagonal at the start of the left alignment (farthest from the arrowhead), and the end of the right alignment (as in lines 1127-1129).

Finally, even though we have allowed alignments with small clipped ends while measuring the size of the SV using diagonals, we have found in practice that alignments with large clipped regions near the breakpoint also lead to false positives. So, we added another parameter, --maxUnmappedBases, to exclude such pairs if desired.

splitters