BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
678 stars 159 forks source link

Paired local alignments have insert size 0 and are within --maxins #346

Open ijhoskins opened 3 years ago

ijhoskins commented 3 years ago

I locally aligned 2x150 bp reads and found paired reads that have insert size 0; however, both reads are mapped and the insert sizes are under the --maxins value.

Here is test data with reference: insert_size0.tar.gz

(base) Ians-MacBook-Pro:samtools_stats ianhoskins$ bowtie2 --version /usr/local/bin/bowtie2-align-s version 2.3.5.1 64-bit Built on Wed Apr 17 02:38:57 UTC 2019 Compiler: InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin Options: -O3 -m64 -msse2 -funroll-loops -g3 -std=c++98 -mmacosx-version-min=10.9 -DPOPCNT_CAPABILITY -DWITH_TBB -DNO_SPINLOCK -DWITH_QUEUELOCK=1 Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

ijhoskins commented 3 years ago

Upon further investigation, insert size=0 is output when including the --no-discordant flag. Is this set when the mate information is not available? With the exception of the few reads with unmapped mates, it is not clear why the other reads- which align as expected and within maxins- would be considered discordant.

ijhoskins commented 3 years ago

Given https://github.com/BenLangmead/bowtie2/issues/344 I suspect there is some interaction between the --no-discordant option and insert size calculation.

BenLangmead commented 3 years ago

I'm unsure where --no-discordant enters into this particular issue. When I run those reads through using:

bowtie2 -x CBS_pEZY3.fa -1 test_insert0.R1.fq -2 test_insert0.R2.fq

I get mostly concordant alignments, all with normal-looking TLEN field, but I also see a couple cases where one end failed to align. For those, TLEN is 0. You can tell which these are because they have the YT:Z:UP extra flag. If I additionally use the --no-dovetail option I use the same thing. If I use the --no-contain option, I see several more reads that fail to align concordantly because this dataset happens to contain many instances of one end "containing" the other (with respect to their alignment intervals).

ijhoskins commented 3 years ago

@BenLangmead thanks for taking a look. If I run without the --no-discordant option I get normal TLEN fields. It is only with this option when I get pairs that have size 0.

The above is true when I provide the following other flags: --maxins 1000 --no-discordant --fr --rf --mp 4 --rdg 6,4 --rfg 6,4

However, the --no-discordant option behaves as expected when running under default conditions (no other options passed). Furthermore, when running under default many of the pairs that were discordant (YT:Z:DP) become concordant (YT:Z:CP).

Is it valid that I pass both --fr and --rf? For my dataset I supply these flags as R1 can align to either the forward or reverse strand (with its mate on the opposite strand).

ijhoskins commented 3 years ago

I think this is related to my understanding of --rf. Does this flag mean: 1) first segment maps to reverse strand and second segment maps to forward strand, or 2) leftmost segment maps to reverse strand and rightmost segment maps to forward strand

I assumed 1. If it is 2, I think the manual/help might better describe these flags as --fr (3' ends pointed towards one another) --rf (3' ends pointed away from one another)

ch4rr0 commented 3 years ago

--rf, --fr, --ff will override each other. Maybe we should treat them as mutually exclusive and throw an error if they're used in combination with each other.

Here's a detailed explanation from the manual.

    --fr/--rf/--ff

The upstream/downstream mate orientations for a valid paired-end
alignment against the forward reference strand. E.g., if --fr is
specified and there is a candidate paired-end alignment where mate 1
appears upstream of the reverse complement of mate 2 and the fragment
length constraints (-I and -X) are met, that alignment is valid. Also,
if mate 2 appears upstream of the reverse complement of mate 1 and all
other constraints are met, that too is valid. --rf likewise requires
that an upstream mate1 be reverse-complemented and a downstream mate2 be
forward-oriented. --ff requires both an upstream mate 1 and a downstream
mate 2 to be forward-oriented. Default: --fr (appropriate for Illumina's
Paired-end Sequencing Assay).