alignment optimization - Githubissues

alignment of select samples is taking a significant amount of times (+6 days). attempting to determine:

is alignment time due to multimapping or due to NOVAALIGN spending a significant time on unmapped reads?

will run alignment with a mm of 2. determine time. will subset reads with a mm of 2 and split files into smaller files. remap the reads with a mm of 2 using a max of 999 (current setting) merge unique reads and mm of 999
if it is due to multimapping then reducing max of 999 to max of 2 should decrease time (from 6+ days). if it's due to other issues with NOVAALIGN it will still take 6+ days to align.

I ran the alignment test we talked about and three samples have finished. Review this data has left me with some questions, hopefully the group can help clarify. For reference, this sample had 12 splits.

selected files

I took 6 files, 3 split files that did finish alignment (N=11,12,13) and 3 split files that did not finish alignment (N=14,15,16). each file had 2,888,431 reads.

alignment

I used novoalign with the flag -m ALL 2 as we discussed. Sample 11 finished, so i'll use that as an example.

This was the time stats:

Elapsed Time: 62525.062 (secs.) ~17 hours
CPU Time: 31,933.93 (min.)

I looked at the alignment information - 50% of the reads were tossed out due to read length, 17% of the reads were tossed because no mapping was found; 36% mapped - of which 24% was unique and 9% was multimapped. Read Sequences: 2,888,431

Unique Alignment: 690,061 (23.9%)
Multi Mapped: 258,254 ( 8.9%)
No Mapping Found: 488,195 (16.9%)
QC Failures -- Homopolymer Filter: 1,557 ( 0.1%) -- Read Length: 1,450,364 (50.2%)

NH Flag summary

I then went and grabbed the NH flags to make sure that novalign performed as we expected, and wrote an r script that plotted all of the reads. As you can see from the graph below, the -r ALL 2 flag is obviously not working as we expect it to (http://www.novocraft.com/documentation/novoalign-2/novoalign-user-guide/novoalign-command-options/). While the majority of our data is mm < 10, we have some reads that are mapping over a 1000 times. Ro_Clip split 11 nh

To compare these are samples 12,13 which also finished: Ro_Clip split 12 nh Ro_Clip split 13 nh

Finally, I pulled the -r ALL 999 file which shows that the number of multimapping in this sample (sample 11) is insanely high m_999_nh

follow-up

I pulled the reads with the NH:i:1000 flag and ran the following 5 tests, with the same params in all tests (System 72 CPU Threads, 377G Bytes RAM):

Read information

Test 1 - EXHAUSTIVE, 2 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 2 - EXHAUSTIVE, 999 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 3 - ALL, 2 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 4 - ALL, 999 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 5 - NONE, 2 -Failed: integer not accepted with "NONE" flag

Time Information

Test 1 - EXHAUSTIVE, 2 -Timers --Elapsed Time: 1.257 (secs.) --CPU Time: 2.31 (secs.) Test 2 - EXHAUSTIVE, 999 -Timers --Elapsed Time: 1.416 (secs.) --CPU Time: 2.70 (secs.) Test 3 - ALL, 2 -Timers --Elapsed Time: 3.999 (secs.) --CPU Time: 5.53 (secs.) Test 4 - ALL, 999 -Timers --Elapsed Time: 4.100 (secs.) --CPU Time: 6.08 (secs.) Test 5 - NONE, 2 --Failed: integer not accepted with "NONE" flag

When I pulled the multimap numbers for these tests, it was quite surprising: nh_1000_e2 nh_1000_e999 nh_1000_m2 nh_1000_m999

The results can be summarized as follows:

the integer that follows the EXHAUSTIVE flag does not limit the multimapping to 2 (5% >2000), however as compared to the 999 flag (5.3% >2000) it does reduce the number of multimaps
the ALL flag in comparison, did not have as many multimaps, as the 2 flag limited the total number to 100% <1010 with the 999 flag and 5.3% <2000 with the 999 flag
what is most confusing is the time that it took for this to run. Even though the EXHAUSTIVE flags had more multimaps performed, they both performed more quickly than the ALL flag

Update:

Information on the method can be found below:

Method	Limit	-t threshold	Description
None	NA	Optional	No alignments will be reported. The read will be reported as a type R with no alignment locations. A reporting “limit” should not be set.
Random	NA	Optional	A single alignment location is randomly chosen from amongst all the alignment results. A reporting “limit” should not be set.
All	Optional	Optional	All alignment locations are reported. The ‘All’ method can optionally specify a limit for the number of lines reported. e.g. ‘-r All 10’ will report at most 10 randomly selected alignments.
Exhaustive	Required	Required	Reports all alignments with a alignment score, P(R\|Ai), less than or equal to the threshold plus the -R setting. The ‘Exhaustive’ method requires that a limit for the number of lines reported. e.g. ‘-r E 10’ will report at most 10 randomly selected alignments per read. This is to avoid situations where high copy number repeats result in reporting millions of alignments for a read. The alignment threshold (-t option) must be set when using the -r Exhaustive option.

Summary of the 4 completed flag testing parameters are as follows: Graph Subtitle	Method	Limit	Elapsed Time	CPU Time
a2	ALL	2	3.999 seconds	5.53 seconds
a999	ALL	999	4.100 seconds	6.08 seconds
e2	EXHAUSTIVE	2	1.257 seconds	2.31 seconds
e999	EXHAUSTIVE	999	1.416 seconds	2.70 seconds

It was determined that the NH flag does NOT correlate to the total number of multi-maps in a file; instead this is the IH flag. Documentation is listed here(pg45), and highlighted below:

NH. On Number of reported alignments that contains the query in the current record. Only present if there is more than one alignment reported for the read. Note. In our interpretation of the SAM specification we have taken this as the count of alignments that would be reported if there wasn't a limit imposed by the r option, so this is the number of alignments found with the quality range (R) or threshold limits for r Exhaustive. The IH tag is the number of alignments that were stored in the SAM file.
IH. On Number of stored alignments in SAM that contains the query in the current record. Only present if there is more than one alignment reported for the read.

This is confirmed in the graphs of MM below: Ro_Clip split a2 ih Ro_Clip split a999 ih Ro_Clip split e2 ih Ro_Clip split e999 ih

option -r ALL 999 changed to -r EXHAUSTIVE 999

downstream needed to update NH/IH flag reading. completed in commit e234fef02beeeec58a3a822ebd576fe8ccca67f2 and acf7d707595e40d06bdabd6679024e8337823753

NCI-RBL / iCLIP

alignment optimization #85