NCI-RBL / iCLIP

RNA Biology Pipeline to Characterize protein-RNA Interactions
https://rbl-nci.github.io/iCLIP/
MIT License
4 stars 2 forks source link

alignment optimization #85

Closed slsevilla closed 3 years ago

slsevilla commented 3 years ago

alignment of select samples is taking a significant amount of times (+6 days). attempting to determine:

slsevilla commented 3 years ago

I ran the alignment test we talked about and three samples have finished. Review this data has left me with some questions, hopefully the group can help clarify. For reference, this sample had 12 splits.

selected files

I took 6 files, 3 split files that did finish alignment (N=11,12,13) and 3 split files that did not finish alignment (N=14,15,16). each file had 2,888,431 reads.

alignment

I used novoalign with the flag -m ALL 2 as we discussed. Sample 11 finished, so i'll use that as an example.

This was the time stats:

I looked at the alignment information - 50% of the reads were tossed out due to read length, 17% of the reads were tossed because no mapping was found; 36% mapped - of which 24% was unique and 9% was multimapped. Read Sequences: 2,888,431

NH Flag summary

I then went and grabbed the NH flags to make sure that novalign performed as we expected, and wrote an r script that plotted all of the reads. As you can see from the graph below, the -r ALL 2 flag is obviously not working as we expect it to (http://www.novocraft.com/documentation/novoalign-2/novoalign-user-guide/novoalign-command-options/). While the majority of our data is mm < 10, we have some reads that are mapping over a 1000 times. Ro_Clip split 11 nh

To compare these are samples 12,13 which also finished: Ro_Clip split 12 nh Ro_Clip split 13 nh

Finally, I pulled the -r ALL 999 file which shows that the number of multimapping in this sample (sample 11) is insanely high m_999_nh

follow-up

I pulled the reads with the NH:i:1000 flag and ran the following 5 tests, with the same params in all tests (System 72 CPU Threads, 377G Bytes RAM):

Read information

Test 1 - EXHAUSTIVE, 2 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 2 - EXHAUSTIVE, 999 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 3 - ALL, 2 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 4 - ALL, 999 -Read Sequences: 19 --Unique Alignment: 0 ( 0.0%) --Multi Mapped: 19 (100.0%) --No Mapping Found: 0 ( 0.0%) Test 5 - NONE, 2 -Failed: integer not accepted with "NONE" flag

Time Information

Test 1 - EXHAUSTIVE, 2 -Timers --Elapsed Time: 1.257 (secs.) --CPU Time: 2.31 (secs.) Test 2 - EXHAUSTIVE, 999 -Timers --Elapsed Time: 1.416 (secs.) --CPU Time: 2.70 (secs.) Test 3 - ALL, 2 -Timers --Elapsed Time: 3.999 (secs.) --CPU Time: 5.53 (secs.) Test 4 - ALL, 999 -Timers --Elapsed Time: 4.100 (secs.) --CPU Time: 6.08 (secs.) Test 5 - NONE, 2 --Failed: integer not accepted with "NONE" flag

When I pulled the multimap numbers for these tests, it was quite surprising: nh_1000_e2 nh_1000_e999 nh_1000_m2 nh_1000_m999

The results can be summarized as follows:

slsevilla commented 3 years ago

Update:

Method Limit -t threshold Description
None NA Optional No alignments will be reported. The read will be reported as a type R with no alignment locations. A reporting “limit” should not be set.
Random NA Optional A single alignment location is randomly chosen from amongst all the alignment results. A reporting “limit” should not be set.
All Optional Optional All alignment locations are reported. The ‘All’ method can optionally specify a limit for the number of lines reported. e.g. ‘-r All 10’ will report at most 10 randomly selected alignments.
Exhaustive Required Required Reports all alignments with a alignment score, P(R|Ai), less than or equal to the threshold plus the -R setting. The ‘Exhaustive’ method requires that a limit for the number of lines reported. e.g. ‘-r E 10’ will report at most 10 randomly selected alignments per read. This is to avoid situations where high copy number repeats result in reporting millions of alignments for a read. The alignment threshold (-t option) must be set when using the -r Exhaustive option.
Summary of the 4 completed flag testing parameters are as follows: Graph Subtitle Method Limit Elapsed Time CPU Time
a2 ALL 2 3.999 seconds 5.53 seconds
a999 ALL 999 4.100 seconds 6.08 seconds
e2 EXHAUSTIVE 2 1.257 seconds 2.31 seconds
e999 EXHAUSTIVE 999 1.416 seconds 2.70 seconds

It was determined that the NH flag does NOT correlate to the total number of multi-maps in a file; instead this is the IH flag. Documentation is listed here(pg45), and highlighted below:

This is confirmed in the graphs of MM below: Ro_Clip split a2 ih Ro_Clip split a999 ih Ro_Clip split e2 ih Ro_Clip split e999 ih

slsevilla commented 3 years ago

option -r ALL 999 changed to -r EXHAUSTIVE 999

downstream needed to update NH/IH flag reading. completed in commit e234fef02beeeec58a3a822ebd576fe8ccca67f2 and acf7d707595e40d06bdabd6679024e8337823753