BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
638 stars 160 forks source link

Feature request: another mode of "--un-mates" #442

Closed ZhangMH2000 closed 11 months ago

ZhangMH2000 commented 11 months ago

Dear developer, Hello. I appreciate your efforts in developing this tool and adding the “--un-mates” option. I am currently using SAHMI to annotate microbial information from single-cell sequencing data (scRNA). SAHMI annotates both human and microbial information when using the kraken2 database. However, I would like to apply a more stringent method for annotation, which is to first filter out the host information using bowtie2, and then annotate the filtered data using the kraken2 database. I tried using the “--un-conc” option of bowtie2, which outputs all paired-reads that are unaligned with the reference genome. However, I don’t know whether it is because of the characteristics of scRNA, even though 100.00% reads were paired and 50.41% overall alignment rate, only 1.59% reads were aligned concordantly >=1 times. Therefore, the filtered reads still contain plenty of human reads. The output was:

7261415 reads; of these:
  7261415 (100.00%) were paired; of these:
    7145843 (98.41%) aligned concordantly 0 times
    34064 (0.47%) aligned concordantly exactly 1 time
    81508 (1.12%) aligned concordantly >1 times
    ----
    7145843 pairs aligned concordantly 0 times; of these:
      6661 (0.09%) aligned discordantly 1 time
    ----
    7139182 pairs aligned 0 times concordantly or discordantly; of these:
      14278364 mates make up the pairs; of these:
        7201723 (50.44%) aligned 0 times
        2756103 (19.30%) aligned exactly 1 time
        4320538 (30.26%) aligned >1 times
50.41% overall alignment rate

So I further tried the “--un-mates” option, and although it outputs all unaligned reads, the two files obtained are not paired, making them incompatible with the subsequent paired kraken2 process. Therefore, I would like to consult with your professional opinion. Is bowtie2 suitable for scRNA sequencing data? If the remaining reads are mostly human, do you think it is possible to develop a new option “--un-align” to output paired reads that are neither concordantly nor discordantly aligned with the reference genome? So that more host-depleted reads can be filtered. I would be very grateful if you could provide assistance for this request. Thank you!

ch4rr0 commented 11 months ago

Hello @ZhangMH2000,

So I further tried the “--un-mates” option, and although it outputs all unaligned reads, the two files obtained are not paired...

Can you please explain why do you think the reads are unpaired? Here is an example of a sample run making use of --un-mates flag:

$ ./bowtie2 -x example/index/lambda_virus -1 example/reads/reads_1.fq -2 example/reads/reads_2.fq --un-mates foo -S /dev/null

$ head foo.1
@r146
TTCGGNNCGCNNNNAAANAGGCTGAGCACGGTGTACGTCAGNCCGGAAAAGTGC
+
)!&))%(%'($"!!&$)!++#$*'!!*(&!&+&&%"''##$$"$#'*%!%)#"+
@r152
CTCCCCNCCGNNCTNNCGNCTTCGTAATACTCACGCTGCT
+
"(*(*#%)**%($(!"''!&$+)&'(%'*"$"&!#$!!&!
@r198
CTTTTTGACGCCGTNTTCCACGTACTGTCCGGAATATACGAC

$ head foo.2
@r146
NNNGNTGTACCGGCTGTCTGGTATGTATGAGTTTGTGGTGAATAATGCCCCTGAACAGACAGAGGACGCCGGGCCCGCAGAGCCTGTTTCTGCGGGAAAGTGTTCGACGGTGAGCTGAGNTTTGCCCTGAAACTGG
+
%!!'"(+&""+$)")#*%+%&+$&$"'++!'(%)&#!*%'(&*$)$$'#(!)$'#!%&*%(#))##(*%#)+)&+!(!'*#)*"((!&$!+'&$(++!#"#+%+!(!$+(!&"!&((++&(&(!(++(%)($)!%"
@r152
NATTTGNGNCAGACCTTTTCCATGAATTGGTAACACCATCGATTGTGCTGGAACTGCTGGATGAACGGGAAAGAAACCAGCAATACATCAAACGCCGCGACC
+
!%$'''#)#'$(&'+!%$#)('%!"++!"("#'(&""'(&"$('")+#*#)(**')%&'&!($%%%%(+#')$%&&#()!)''"$$&%$%!($$)#(#%%%*
@r198
GCGCTGTTCCGTNCTNCAAAGCCGTCAAGGAGAAGCTGGATACCCGTCGTGGNT

...do you think it is possible to develop a new option “--un-align” to output paired reads that are neither concordantly nor discordantly aligned with the reference genome?

--un-mates <filename> is supposed to do exactly that. It creates two files, .1 and .2, that contain the unaligned mate1 and mate2 respectively. If does not work as described, can you post an example highlighting the issue?

Is bowtie2 suitable for scRNA sequencing data? Yes it should be.

Thank you

ZhangMH2000 commented 10 months ago

Hi @ch4rr0 From #417 , you showed an example that only one read of two paired-reads was aligned to the reference. However, the output was

$ cat out.1

$ cat out.2
@r1
GAATACTGGCGGATTACCGGGGAAGCTGGAGC
+
EDCCCBAAAA@@@@?>===<;;9:99987776

which means only read 1 was filtered and read 2 was kept. Actually, what I want to achieve is to eliminate both reads if one of the reads or both reads can be aligned to the reference.

However, I closed this issue after some learning. Although the 10X genomics single-cell RNA sequencing used a paired sequencing, the output --R1.fastq and R2.fastq, only R2.fastq contains the true sequence reads, and R1.fastq only contains bases information for barcodes and UMI. Unlike traditional paired-sequencing that both reads mean the true sequencing reads, I think I can only input the R2.fastq to align to the reference. Indeed, when I input only the R2.fastq as “unpaired reads”, the bowtie2 annotated 99% human reads. I then used the filtered reads to generate the matching R1.fastq. So I don’t need this “–un-align” function. Anyway, thank you so much for your reply. If you have any views on my process, please let me know. Thanks again!