amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

The db is bigger and bigger but the aligenment time is shorter and shorter, why? #75

Closed wangzhenkeep closed 7 years ago

wangzhenkeep commented 8 years ago

I have a db(all.fasta) with 169471 sequences in it .And it is productet from RDP database. The other db are producted with this command: head -num all.fasta >new.fasta.The SNAP command is like this: snap paired db all.fastq.1.fq all.fastq.2.fq -I -o test.sam -t 4 -x -f -h 250 -d 12 -n 25 -pre -map -om 25 -omax 25 -D 25. The machine is same,the inputfile are same. The relationship between db with running time is in the tables:

sequence_num Size(M) time
4000 5.8 8m4s
8000 11.5 12m32s
16500 23.8 12m44s
33000 47.7 6m38s
49500 71.5 6m29s
66000 95.3 5m4s
82500 119.1 5m45s
169471 244.7 3m6s
sequence_num MapQ>=10 MapQ<10 Unaligned Pairs
4000 11.61% 86.38% 2.01% 86.02%
8000 16.00% 82.11% 1.88% 86.82%
16500 6.18% 92.11% 1.71% 87.30%
33000 33.35% 64.18% 2.47% 89.80%
49500 39.31% 54.65% 6.05% 90.19%
66000 37.98% 55.69% 6.33% 90.15%
82500 38.21% 52.99% 8.80% 87.05%
169471 33.20% 58.34% 8.46% 86.58%
xhongyi commented 8 years ago

Can you post the percentage of reads that are aligned in pairs? SNAP will try to align reads in pairs and when it fails, it will align each read individually.

Note that SNAP's paired alignment is faster than single-end alignment.

One possibilities is that with larger DB, you have more reads aligned in pairs, hence skipping the single alignment process. Maybe that's why you see a greater speed.

Hongyi

On Wed, Aug 3, 2016 at 12:50 AM, CelLoud王震 notifications@github.com wrote:

I have a db(all.fasta) with 169471 sequences in it .And it is productet from RDP database. The other db are producted with this command: head -num all.fasta >new.fasta.The SNAP command is like this: snap paired db all.fastq.1.fq all.fastq.2.fq -I -o test.sam -t 4 -x -f -h 250 -d 12 -n 25 -pre -map -om 25 -omax 25 -D 25. The machine is same,the inputfile are same. The relationship between db with running time is in the tables: sequence_num Size(M) time 4000 5.8 8m4s 8000 11.5 12m32s 16500 23.8 12m44s 33000 47.7 6m38s 49500 71.5 6m29s 66000 95.3 5m4s 82500 119.1 5m45s 169471 244.7 3m6s sequence_num MapQ>=10 MapQ<10 Unaligned Pairs 4000 11.61% 86.38% 2.01% 86.02% 8000 16.00% 82.11% 1.88% 86.82% 16500 6.18% 92.11% 1.71% 87.30% 33000 33.35% 64.18% 2.47% 89.80% 49500 39.31% 54.65% 6.05% 90.19% 66000 37.98% 55.69% 6.33% 90.15% 82500 38.21% 52.99% 8.80% 87.05% 169471 33.20% 58.34% 8.46% 86.58%

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amplab/snap/issues/75, or mute the thread https://github.com/notifications/unsubscribe-auth/AD9rdExxESptaazeyeEc9WZN2MLQbkFZks5qcEgrgaJpZM4JbZQQ .

Hongyi Xin

Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/

bolosky commented 8 years ago

Probably because it’s finding better matches with the larger fasta file.

SNAP’s optimizations make it run faster the better match it finds in several ways, so even though there’s more reference to search through, it will go faster.

For instance, it spends a lot of its time computing distances between the read and the reference using an algorithm that takes time proportional to the length of the read times the final edit distance, but that is able to give up when it knows the answer will be too large to affect the result. So, if it can find a good match then it will compute the edit distance quickly (because it is small) and then will be able to give up quickly on not-so-good matches that otherwise would have taken a lot of time. Other optimizations allow it to avoid trying to compute the edit distance at all in some cases once it’s found a good match.

So, this isn’t evidence of a bug, it’s exactly what should happen.

--B

From: CelLoud王震 [mailto:notifications@github.com] Sent: Wednesday, August 3, 2016 12:50 AM To: amplab/snap snap@noreply.github.com Subject: [amplab/snap] The db is bigger and bigger but the aligenment time is shorter and shorter, why? (#75)

I have a db(all.fasta) with 169471 sequences in it .And it is productet from RDP database. The other db are producted with this command: head -num all.fasta >new.fasta.The SNAP command is like this: snap paired db all.fastq.1.fq all.fastq.2.fq -I -o test.sam -t 4 -x -f -h 250 -d 12 -n 25 -pre -map -om 25 -omax 25 -D 25. The machine is same,the inputfile are same. The relationship between db with running time is in the tables: sequence_num

Size(M)

time

4000

5.8

8m4s

8000

11.5

12m32s

16500

23.8

12m44s

33000

47.7

6m38s

49500

71.5

6m29s

66000

95.3

5m4s

82500

119.1

5m45s

169471

244.7

3m6s

sequence_num

MapQ>=10

MapQ<10

Unaligned

Pairs

4000

11.61%

86.38%

2.01%

86.02%

8000

16.00%

82.11%

1.88%

86.82%

16500

6.18%

92.11%

1.71%

87.30%

33000

33.35%

64.18%

2.47%

89.80%

49500

39.31%

54.65%

6.05%

90.19%

66000

37.98%

55.69%

6.33%

90.15%

82500

38.21%

52.99%

8.80%

87.05%

169471

33.20%

58.34%

8.46%

86.58%

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/amplab/snap/issues/75, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA752eeAzsUXBxRdDRdpeKKPWpHc-NZtks5qcEgrgaJpZM4JbZQQ.

wangzhenkeep commented 8 years ago

with the sequences of db is more and more,the unaligned reads is more and more . How did this happen?

wangzhenkeep commented 8 years ago

What is the rules of count the reads with MapQ>=10 ? As I count it only with the MapQ>=10,the num is different with the report of SNAP.

xhongyi commented 8 years ago

MapQ is calculated as described here: http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores

If the larger db is more repetitive, then you will have more reads mapped to multiple places. If that is the case, then these now multi-mapped reads will have lower MapQ.

MapQ>0 is a simple default cut off for snap. It is how snap determines if a mapping is of low confidence. However, it is a rather rough measure.

So in short, I think maybe your larger db is more repetitive hence while more reads are mapped, a larger portion of them are now ambiguously mapped, therefore having lower mapQ scores.

Hongyi

On Thu, Aug 4, 2016 at 12:39 AM, CelLoud王震 notifications@github.com wrote:

What is the rules of count the reads with MapQ>=10 ? As I count it only with the MapQ>=10,the num is different with the report of SNAP.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amplab/snap/issues/75#issuecomment-237476869, or mute the thread https://github.com/notifications/unsubscribe-auth/AD9rdDQrvjfmvZ2qnwz0OBEZ-d0q5pQcks5qcZc2gaJpZM4JbZQQ .

Hongyi Xin

Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/

xhongyi commented 8 years ago

Sorry. I meant mapQ>10.

On Thu, Aug 4, 2016 at 8:50 AM, Hongyi Xin gohongyi@gmail.com wrote:

MapQ is calculated as described here: http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores

If the larger db is more repetitive, then you will have more reads mapped to multiple places. If that is the case, then these now multi-mapped reads will have lower MapQ.

MapQ>0 is a simple default cut off for snap. It is how snap determines if a mapping is of low confidence. However, it is a rather rough measure.

So in short, I think maybe your larger db is more repetitive hence while more reads are mapped, a larger portion of them are now ambiguously mapped, therefore having lower mapQ scores.

Hongyi

On Thu, Aug 4, 2016 at 12:39 AM, CelLoud王震 notifications@github.com wrote:

What is the rules of count the reads with MapQ>=10 ? As I count it only with the MapQ>=10,the num is different with the report of SNAP.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amplab/snap/issues/75#issuecomment-237476869, or mute the thread https://github.com/notifications/unsubscribe-auth/AD9rdDQrvjfmvZ2qnwz0OBEZ-d0q5pQcks5qcZc2gaJpZM4JbZQQ .

Hongyi Xin

Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/

Hongyi Xin

Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/