GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
224 stars 30 forks source link

split reads detection parameters #34

Closed alantsangmb closed 6 years ago

alantsangmb commented 6 years ago

Hi, I found samblaster very useful to detect the discordant and split reads. However, I am not quite understand the meaning of parameters.

I would like to know which parameters I should modify if I want to detect overlapping reads with short split like the following representation:

The read length are 300 bp and they are paired end. Assuming 290bp is mapped on chr1 and the remaining 10bp is mapped on chr2. I found they cannot be detected using default parameteres. How can I detect them? Thanks.

------------chr1------|-----chr2-------------

             [====:==>
             <====:===]                     both reads split  

       [=======>
             <====:===]                     read 2 split but overlapping

             [====:==>
                    <========]             read 1 split but overlapping
GregoryFaust commented 6 years ago

When detecting splitters, each read of a paired-end pair is considered separately. The response to issue https://github.com/GregoryFaust/samblaster/issues/33 includes a diagram that explains most parameters that samblaster uses to detect splitters. However, as in your case, when the two adjacent fragments from a read map to different chromosomes, the X axis (reference coordinates) portion of that diagram will not be an issue, and the two fragments will only need to meet the --minNonOverlap criteria.

This criteria requires that each of the two adjacent read fragments have at least --minNonOverlap bases that do NOT overlap the other adjacent read fragment in query offsets. In the picture below, we have laid the Y-axis on its side to examine how the fragments relate to each other along the length of the query. Note that in this analysis, we are only concerned with how the fragments cover the query, and not where they align to the reference or strand. Fragment1 and fragment2 overlap each in the region labelled B. To be considered a split-read, both region A and region C must both be at least --minNonOverlap bases long. This is required even if the fragments do not overlap at all (region B does not exist). This implies that each of the fragments must be at least --minNonOverlap base pairs long.

                             A                          B                     C
fragment1:  |---------------------------------|----------------------|
fragment2:                                    |----------------------|------------------|
_________________________________________________________________________________________
            0                         Query Offset                                  100 -->

So, in your case, if your read is split in such a way that one of the fragments is only 10 base pairs long, then you must set --minNonOverlap to at most 10 to have that read considered as a valid splitter.