BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
639 stars 159 forks source link

-a option returns one valid alignment, but -k 1 returns none. #386

Open bmorledge-hampton19 opened 2 years ago

bmorledge-hampton19 commented 2 years ago

I am confused by the behavior of the -k and -a options. I am attempting to align a set of short (23-31 bp) reads which may have up to two adjacent mismatches in them. Because of this, I am decreasing -L significantly (as low as 10) in order to still align those reads with 2 mismatches. I am aware that in many cases, this will cause the read to fail to be aligned because of the -D option. To circumvent this, I've tried to use the -k option to simply accept the first alignment that meets the score threshold (which should be tolerant to two mismatches). However, this is not working as expected.

Using the following read (which is expected to have a CC>TT mismatch at the 7th and 8th positions from the right-hand side):

@SRR3062593.2063075.1 HISEQ:224:C5LRDANXX:5:1311:8220:30771 length=50
GGTATTGCTGGAATTTCCAGCATTTGCCAG
+SRR3062593.2063075.1 HISEQ:224:C5LRDANXX:5:1311:8220:30771 length=50
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFF

And the following alignment parameters (and hg19 index):

-L 10 -k 1 --mp 6,6 --score-min L,-12,-0.1

I get the following result:

1 reads; of these:
  1 (100.00%) were unpaired; of these:
    1 (100.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
0.00% overall alignment rate

However, when I run the same read with the -a flag instead of the -k 1 parameter, it aligns the read to exactly one site. Clearly, there is one (and only one) valid alignment for the read, so why isn't this found using -k 1?

Alternatively, I could circumvent this problem entirely if I had a way to allow 2 adjacent mismatches during seeding. If anyone is aware of a tool/method to achieve this, I'm all ears!

I am using the most recent version of bowtie2, v2.4.5

ch4rr0 commented 2 years ago

Hello,

At first glance, this seems like a bug to me. I will try to recreate the issue and get back to you. Thank you in advance for your patience.

ch4rr0 commented 1 year ago

Hello,

I apologize for the delayed response. Bowtie, https://github.com/BenLangmead/bowtie, may be a better option for you. It's a predecessor to bowtie2 that works better with short reads. The following options from bowtie may be what you are looking for:

Alignment:
  -v <int>           report end-to-end hits w/ <=v mismatches; ignore qualities
    or
  -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
ch4rr0 commented 1 year ago

If you are able to provide the index used that will help expedite the recreation of the issue.

seahurt commented 1 year ago

It happends in my work often. When I using a big index, -k 100 might fail to get any aliment, while -a won't.

I think it might be caused by the search effort options or the randomization mechanism.

bmorledge-hampton19 commented 1 year ago

I was using an index of the hg19 genome. Apologies for not providing this information previously. I've edited my initial post to make this clear. I can also provide the files for the index if you'd prefer.

schorlton commented 1 year ago

@ch4rr0 - any ideas here? I'm interested in using bowtie2 to filter reads which map anywhere against a reference, therefore -k 1 seems like it would be the appropriate approach to reduce alignment time, however changing behaviour so the alignment is not found as in this issue makes me a little nervous. Thanks for your consideration!