Closed slsevilla closed 3 years ago
I ran the alignment test we talked about and three samples have finished. Review this data has left me with some questions, hopefully the group can help clarify. For reference, this sample had 12 splits.
I took 6 files, 3 split files that did finish alignment (N=11,12,13) and 3 split files that did not finish alignment (N=14,15,16). each file had 2,888,431 reads.
I used novoalign with the flag -m ALL 2 as we discussed. Sample 11 finished, so i'll use that as an example.
This was the time stats:
I looked at the alignment information - 50% of the reads were tossed out due to read length, 17% of the reads were tossed because no mapping was found; 36% mapped - of which 24% was unique and 9% was multimapped. Read Sequences: 2,888,431
I then went and grabbed the NH flags to make sure that novalign performed as we expected, and wrote an r script that plotted all of the reads. As you can see from the graph below, the -r ALL 2 flag is obviously not working as we expect it to (http://www.novocraft.com/documentation/novoalign-2/novoalign-user-guide/novoalign-command-options/). While the majority of our data is mm < 10, we have some reads that are mapping over a 1000 times.
To compare these are samples 12,13 which also finished:
Finally, I pulled the -r ALL 999 file which shows that the number of multimapping in this sample (sample 11) is insanely high
I pulled the reads with the NH:i:1000 flag and ran the following 5 tests, with the same params in all tests (System 72 CPU Threads, 377G Bytes RAM):
Test 1 - EXHAUSTIVE, 2 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 2 - EXHAUSTIVE, 999 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 3 - ALL, 2 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 4 - ALL, 999 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 5 - NONE, 2 -Failed: integer not accepted with "NONE" flag
Test 1 - EXHAUSTIVE, 2 -Timers --Elapsed Time: 1.257 (secs.) --CPU Time: 2.31 (secs.) Test 2 - EXHAUSTIVE, 999 -Timers --Elapsed Time: 1.416 (secs.) --CPU Time: 2.70 (secs.) Test 3 - ALL, 2 -Timers --Elapsed Time: 3.999 (secs.) --CPU Time: 5.53 (secs.) Test 4 - ALL, 999 -Timers --Elapsed Time: 4.100 (secs.) --CPU Time: 6.08 (secs.) Test 5 - NONE, 2 --Failed: integer not accepted with "NONE" flag
When I pulled the multimap numbers for these tests, it was quite surprising:
The results can be summarized as follows:
Update:
Method | Limit | -t threshold | Description |
---|---|---|---|
None | NA | Optional | No alignments will be reported. The read will be reported as a type R with no alignment locations. A reporting “limit” should not be set. |
Random | NA | Optional | A single alignment location is randomly chosen from amongst all the alignment results. A reporting “limit” should not be set. |
All | Optional | Optional | All alignment locations are reported. The ‘All’ method can optionally specify a limit for the number of lines reported. e.g. ‘-r All 10’ will report at most 10 randomly selected alignments. |
Exhaustive | Required | Required | Reports all alignments with a alignment score, P(R|Ai), less than or equal to the threshold plus the -R setting. The ‘Exhaustive’ method requires that a limit for the number of lines reported. e.g. ‘-r E 10’ will report at most 10 randomly selected alignments per read. This is to avoid situations where high copy number repeats result in reporting millions of alignments for a read. The alignment threshold (-t option) must be set when using the -r Exhaustive option. |
Summary of the 4 completed flag testing parameters are as follows: Graph Subtitle | Method | Limit | Elapsed Time | CPU Time |
---|---|---|---|---|
a2 | ALL | 2 | 3.999 seconds | 5.53 seconds |
a999 | ALL | 999 | 4.100 seconds | 6.08 seconds |
e2 | EXHAUSTIVE | 2 | 1.257 seconds | 2.31 seconds |
e999 | EXHAUSTIVE | 999 | 1.416 seconds | 2.70 seconds |
It was determined that the NH flag does NOT correlate to the total number of multi-maps in a file; instead this is the IH flag. Documentation is listed here(pg45), and highlighted below:
This is confirmed in the graphs of MM below:
option -r ALL 999 changed to -r EXHAUSTIVE 999
downstream needed to update NH/IH flag reading. completed in commit e234fef02beeeec58a3a822ebd576fe8ccca67f2 and acf7d707595e40d06bdabd6679024e8337823753
alignment of select samples is taking a significant amount of times (+6 days). attempting to determine:
is alignment time due to multimapping or due to NOVAALIGN spending a significant time on unmapped reads?
if it is due to multimapping then reducing max of 999 to max of 2 should decrease time (from 6+ days). if it's due to other issues with NOVAALIGN it will still take 6+ days to align.