BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
639 stars 159 forks source link

How To Extract Reads-Pairs Aligned Concordantly Exactly 1 Time from Bowtie2 (version 2.3.4.1 or high)? #406

Closed D1Pan closed 1 year ago

D1Pan commented 1 year ago

I tried to extract reads-pairs aligned concordantly exactly 1 time from Bowtie2 (version 2.3.4.1 or high). This was fixed in Bowtie2's version a few years ago (the previous solution and another solution), but now the problem has resurfaced.

When I run:

bowtie2 -x sample.index -p 50 -t -1 sample_R1.trimmed.fastq -2 sample_R2.trimmed.fastq -S example.sam -k2 --no-discordant --no-mixed

The results would look like this:

Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Multiseed full-index search: 00:00:35
20826672 reads; of these:
  20826672 (100.00%) were paired; of these:
    15275620 (73.35%) aligned concordantly 0 times
    5048169 (24.24%) aligned concordantly exactly 1 time
    502883 (2.41%) aligned concordantly >1 times
26.65% overall alignment rate
Time searching: 00:00:35
Overall time: 00:00:35

What I want to separate and extract is the '5048169 (24.24%) aligned concordantly exactly 1 time' reads-pairs.

But when I followed the previous workarounds, I found:

1)

samtools view -hf 0x2 example.sam | grep -v "XS:i:" | wc -l

or

samtools view -bS example.sam > example.bam; samtools view -hf 0x2 example.bam | grep -v "XS:i:" | wc -l

and result(s): 10196512, which is not equal to 5,048,169*2 = 10,096,338 reads that I want.

2)

samtools view -hf 0x2 example.sam | grep -v "XS:i:" | foo.py | wc -l

or

samtools view -bS example.sam > example.bam; samtools view -hf 0x2 example.bam | grep -v "XS:i:" | foo.py | wc -l

and result(s): 10193259, which is not equal to 5,048,169*2 = 10,096,338 reads that I want.

foo.py is (from Mr. Devon Ryan's solution):

#!/usr/bin/env python
import csv
import sys

f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
last_read = None
for line in f :
    #take care of the header
    if(line[0][0] == "@") :
        of.writerow(line)
        continue

    if(last_read == None) : 
        last_read = line
    else :
        if(last_read[0] == line[0]) :
            of.writerow(last_read)
            of.writerow(line)
            last_read = None
        else :
            last_read = line

3)

samtools view example.sam | grep YT:Z:CP | uniq -u | wc -l

or

samtools view -bS example.sam | grep YT:Z:CP | uniq -u | wc -l

and result(s): 12107870, which is not equal to 5,048,169*2 = 10,096,338 reads that I want.

4)

samtools view -hf 0x2 example.sam | grep -v "XS:i:" | grep YT:Z:CP | uniq -u | wc -l

or samtools view -hf 0x2 example.sam | grep -v "XS:i:" | grep YT:Z:CP | wc -l or samtools view -hf 0x2 example.bam | grep -v "XS:i:" | grep YT:Z:CP | wc -l

and result(s): 10159819, which is not equal to 5,048,169*2 = 10,096,338 reads that I want.

I've tried almost every existing solution but to no avail. Urgently ask for your help.

ch4rr0 commented 1 year ago

Hello,

Here's my way of extracting the unique concordant reads. My command line is similar to yours but with the addition of the --no-unal flag to filter out unaligned reads and --sam-nohead to get rid of the SAM header.

$ ./bowtie2 -x zebra_fish -1 reads_1.fq -2 reads_2.fq --threads 4 --upto 100 --no-discordant --no-mixed --sam-nohead --no-unal -S out.sam
100 reads; of these:
  100 (100.00%) were paired; of these:
      64 (64.00%) aligned concordantly 0 times
      27 (27.00%) aligned concordantly exactly 1 time
      9 (9.00%) aligned concordantly >1 times
36.00% overall alignment rate

$ cat out.sam | grep -v XS:i | wc -l
    54
D1Pan commented 1 year ago

Hello,

Here's my way of extracting the unique concordant reads. My command line is similar to yours but with the addition of the --no-unal flag to filter out unaligned reads and --sam-nohead to get rid of the SAM header.

$ ./bowtie2 -x zebra_fish -1 reads_1.fq -2 reads_2.fq --threads 4 --upto 100 --no-discordant --no-mixed --sam-nohead --no-unal -S out.sam
100 reads; of these:
  100 (100.00%) were paired; of these:
      64 (64.00%) aligned concordantly 0 times
      27 (27.00%) aligned concordantly exactly 1 time
      9 (9.00%) aligned concordantly >1 times
36.00% overall alignment rate

$ cat out.sam | grep -v XS:i | wc -l
    54

Hi, Thanks for your reply. I followed your suggestion and did this:

bowtie2 -x ./index -p 50 -t -1 ./R1.trimmed.fastq -2 ./R2.trimmed.fastq -S example.k1.2.sam --no-discordant --no-mixed --sam-nohead --no-unal

Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Multiseed full-index search: 00:06:15
20826672 reads; of these:
  20826672 (100.00%) were paired; of these:
    15275909 (73.35%) aligned concordantly 0 times
    5048396 (24.24%) aligned concordantly exactly 1 time
    502367 (2.41%) aligned concordantly >1 times
26.65% overall alignment rate
Time searching: 00:06:15
Overall time: 00:06:15

And:

cat example.k1.2.sam | grep -v XS:i | wc -l

10119065

The end result is not the expected number, even odd. Any further suggestions?

Thanks again.

Regards, Sugar

ch4rr0 commented 1 year ago

We pushed a change to the bug_fixes branch that would make the "unique concordant pair" count compatible with filters on the SAM output. Example:

{
    m1=$0
    getline
    m2=$0

    if (m1 ~ /YT:Z:CP/ && m2 ~ /YT:Z:CP/ && m1 !~ /XS:i/ && m2 !~ /XS:i/)
        print m1"\n"m2
}

If I were to save the above AWK script to a file unique.awk and run the command-line below the output should now be twice "the unique concordant pairs" in the output summary, as expected.

./bowtie2-align-s -x index -1 reads_1.fq -2 reads_2.fq | awk -f unique.awk | wc -l

Please let us know your thoughts on these changes.

D1Pan commented 1 year ago

I appreciate your help. I understand the principle you mentioned, but it does not solve the problem perfectly. Perhaps the reason for the problem is in the printed output of the Bowtie2 visualization pipeline, rather than the reads filtering self. I've written an alternative solution, but it doesn't perfectly replicate the 'Aligned Concordantly Exactly 1 Time' value shown on Bowtie2's output display, like your suggestion; but the almost same principles/parameters.