Closed wangzhenkeep closed 7 years ago
Can you post the percentage of reads that are aligned in pairs? SNAP will try to align reads in pairs and when it fails, it will align each read individually.
Note that SNAP's paired alignment is faster than single-end alignment.
One possibilities is that with larger DB, you have more reads aligned in pairs, hence skipping the single alignment process. Maybe that's why you see a greater speed.
Hongyi
On Wed, Aug 3, 2016 at 12:50 AM, CelLoud王震 notifications@github.com wrote:
I have a db(all.fasta) with 169471 sequences in it .And it is productet from RDP database. The other db are producted with this command: head -num all.fasta >new.fasta.The SNAP command is like this: snap paired db all.fastq.1.fq all.fastq.2.fq -I -o test.sam -t 4 -x -f -h 250 -d 12 -n 25 -pre -map -om 25 -omax 25 -D 25. The machine is same,the inputfile are same. The relationship between db with running time is in the tables: sequence_num Size(M) time 4000 5.8 8m4s 8000 11.5 12m32s 16500 23.8 12m44s 33000 47.7 6m38s 49500 71.5 6m29s 66000 95.3 5m4s 82500 119.1 5m45s 169471 244.7 3m6s sequence_num MapQ>=10 MapQ<10 Unaligned Pairs 4000 11.61% 86.38% 2.01% 86.02% 8000 16.00% 82.11% 1.88% 86.82% 16500 6.18% 92.11% 1.71% 87.30% 33000 33.35% 64.18% 2.47% 89.80% 49500 39.31% 54.65% 6.05% 90.19% 66000 37.98% 55.69% 6.33% 90.15% 82500 38.21% 52.99% 8.80% 87.05% 169471 33.20% 58.34% 8.46% 86.58%
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amplab/snap/issues/75, or mute the thread https://github.com/notifications/unsubscribe-auth/AD9rdExxESptaazeyeEc9WZN2MLQbkFZks5qcEgrgaJpZM4JbZQQ .
Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/
Probably because it’s finding better matches with the larger fasta file.
SNAP’s optimizations make it run faster the better match it finds in several ways, so even though there’s more reference to search through, it will go faster.
For instance, it spends a lot of its time computing distances between the read and the reference using an algorithm that takes time proportional to the length of the read times the final edit distance, but that is able to give up when it knows the answer will be too large to affect the result. So, if it can find a good match then it will compute the edit distance quickly (because it is small) and then will be able to give up quickly on not-so-good matches that otherwise would have taken a lot of time. Other optimizations allow it to avoid trying to compute the edit distance at all in some cases once it’s found a good match.
So, this isn’t evidence of a bug, it’s exactly what should happen.
--B
From: CelLoud王震 [mailto:notifications@github.com] Sent: Wednesday, August 3, 2016 12:50 AM To: amplab/snap snap@noreply.github.com Subject: [amplab/snap] The db is bigger and bigger but the aligenment time is shorter and shorter, why? (#75)
I have a db(all.fasta) with 169471 sequences in it .And it is productet from RDP database. The other db are producted with this command: head -num all.fasta >new.fasta.The SNAP command is like this: snap paired db all.fastq.1.fq all.fastq.2.fq -I -o test.sam -t 4 -x -f -h 250 -d 12 -n 25 -pre -map -om 25 -omax 25 -D 25. The machine is same,the inputfile are same. The relationship between db with running time is in the tables: sequence_num
Size(M)
time
4000
5.8
8m4s
8000
11.5
12m32s
16500
23.8
12m44s
33000
47.7
6m38s
49500
71.5
6m29s
66000
95.3
5m4s
82500
119.1
5m45s
169471
244.7
3m6s
sequence_num
MapQ>=10
MapQ<10
Unaligned
Pairs
4000
11.61%
86.38%
2.01%
86.02%
8000
16.00%
82.11%
1.88%
86.82%
16500
6.18%
92.11%
1.71%
87.30%
33000
33.35%
64.18%
2.47%
89.80%
49500
39.31%
54.65%
6.05%
90.19%
66000
37.98%
55.69%
6.33%
90.15%
82500
38.21%
52.99%
8.80%
87.05%
169471
33.20%
58.34%
8.46%
86.58%
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/amplab/snap/issues/75, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA752eeAzsUXBxRdDRdpeKKPWpHc-NZtks5qcEgrgaJpZM4JbZQQ.
with the sequences of db is more and more,the unaligned reads is more and more . How did this happen?
What is the rules of count the reads with MapQ>=10 ? As I count it only with the MapQ>=10,the num is different with the report of SNAP.
MapQ is calculated as described here: http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores
If the larger db is more repetitive, then you will have more reads mapped to multiple places. If that is the case, then these now multi-mapped reads will have lower MapQ.
MapQ>0 is a simple default cut off for snap. It is how snap determines if a mapping is of low confidence. However, it is a rather rough measure.
So in short, I think maybe your larger db is more repetitive hence while more reads are mapped, a larger portion of them are now ambiguously mapped, therefore having lower mapQ scores.
Hongyi
On Thu, Aug 4, 2016 at 12:39 AM, CelLoud王震 notifications@github.com wrote:
What is the rules of count the reads with MapQ>=10 ? As I count it only with the MapQ>=10,the num is different with the report of SNAP.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amplab/snap/issues/75#issuecomment-237476869, or mute the thread https://github.com/notifications/unsubscribe-auth/AD9rdDQrvjfmvZ2qnwz0OBEZ-d0q5pQcks5qcZc2gaJpZM4JbZQQ .
Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/
Sorry. I meant mapQ>10.
On Thu, Aug 4, 2016 at 8:50 AM, Hongyi Xin gohongyi@gmail.com wrote:
MapQ is calculated as described here: http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores
If the larger db is more repetitive, then you will have more reads mapped to multiple places. If that is the case, then these now multi-mapped reads will have lower MapQ.
MapQ>0 is a simple default cut off for snap. It is how snap determines if a mapping is of low confidence. However, it is a rather rough measure.
So in short, I think maybe your larger db is more repetitive hence while more reads are mapped, a larger portion of them are now ambiguously mapped, therefore having lower mapQ scores.
Hongyi
On Thu, Aug 4, 2016 at 12:39 AM, CelLoud王震 notifications@github.com wrote:
What is the rules of count the reads with MapQ>=10 ? As I count it only with the MapQ>=10,the num is different with the report of SNAP.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amplab/snap/issues/75#issuecomment-237476869, or mute the thread https://github.com/notifications/unsubscribe-auth/AD9rdDQrvjfmvZ2qnwz0OBEZ-d0q5pQcks5qcZc2gaJpZM4JbZQQ .
Hongyi Xin
Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/
Computer Science Department Carnegie Mellon University 5000 Forbes ave. Pittsburgh PA 15213 http://www.cs.cmu.edu/~hxin/
I have a db(all.fasta) with 169471 sequences in it .And it is productet from RDP database. The other db are producted with this command: head -num all.fasta >new.fasta.The SNAP command is like this: snap paired db all.fastq.1.fq all.fastq.2.fq -I -o test.sam -t 4 -x -f -h 250 -d 12 -n 25 -pre -map -om 25 -omax 25 -D 25. The machine is same,the inputfile are same. The relationship between db with running time is in the tables: