GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
220 stars 30 forks source link

how do samblaster define a split read? #56

Closed WZo0o closed 1 year ago

WZo0o commented 1 year ago

Dear @GregoryFaust

Samblaster is a great tool to identify split reads and discordant read pairs. I wonder to know how samblaster to define split reads? In addition, if a split read must be flag with 2048? Could you help me?

kind regards Zheng Wang

GregoryFaust commented 1 year ago

There have been several questions about split-read alignments recently. Copied below are the portions of the README.md file with the relevant information. As you can see below, an alignment without either 0x100 or 0x800 (a primary alignment) will always be considered as a potential member of a split-read pair, as well as all supplementary alignments marked with FLAG 0x800 (2048 in base 10). In addition, reads marked with FLAG 0x100 (256 in base 10) will also be considered if the -M option is used. In order to be output as a split-read the pair must also satisfy all the criteria listed under SPLIT READ IDENTIFICATION below.

I hope this answers your question. If not, please ask a more specific question or include more information.

ALIGNMENT TYPE DEFINITIONS: Below, we will use the following definitions for alignment types. Starting with samblaster release 0.1.22, these definitions are affected by the use of the -M option. By default, samblaster will use the current definitions of alignment types as specified in the SAM Specification. Namely, alignments marked with FLAG 0x100 are considered secondary, while those marked with FLAG 0x800 are considered supplementary. If the -M option is specified, alignments marked with either FLAG 0x100 or 0x800 are considered supplementary, and no alignments are considered secondary. A primary alignment is always one that is neither secondary nor supplementary. Only primary and supplementary alignments are used to find chimeric (split-read) mappings. The -M flag is used for backward compatibility with older SAM/BAM files in which "chimeric" alignments were marked with FLAG 0x100, and should also be used with output from more recent runs of bwa mem using its -M option.

SPLIT READ IDENTIFICATION: Split Read alignments are derived from a single read when one portion of the read aligns to a different region of the reference genome than another portion of the read. Such pairs of alignments often define a structural variant (SV) breakpoint, and are therefore useful input to SV detection algorithms such as LUMPY. samblaster uses the following strategy to identify split reads alignments.

  1. Identify reads that have between two and --maxSplitCount primary and supplementary alignments.
  2. Sort these alignments by their strand-normalized position along the read.
  3. Two alignments are output as splitters if they are adjacent on the read, and meet these criteria:
    • each covers at least --minNonOverlap base pairs of the read that the other does not.
    • the two alignments map to different reference sequences and/or strands.
    • the two alignments map to the same sequence and strand, and represent a SV that is at least --minIndelSize in length, and have at most --maxUnmappedBases of un-aligned base pairs between them.
  4. Split read alignments that are part of a duplicate read will be output unless the -e option is used.
WZo0o commented 1 year ago

Thanks for your reply