cancerit / BRASS

Breakpoints via assembly - Identifies breaks and attempts to assemble rearrangements in whole genome sequencing data.
GNU Affero General Public License v3.0
57 stars 20 forks source link

how to optimize performance #103

Closed anoronh4 closed 3 years ago

anoronh4 commented 3 years ago

We are testing BRASS and we are not seeing any memory options, meanwhile all but one completed run we have has an average memory usage of <1Gb. just wondering if there is anything further that we can do to use more memory and overall improve performance. we are using 2 cpus. we have plans to parallelize the first two steps, input and cover, but this only helps so much.

keiranmraine commented 3 years ago

The input and cover steps work in a single pass method with very little memory requirement.

Input can be run with 2 threads (or 2 parallel processes, index 1..2), while cover can be run "per-contig" (index 1..N), although we recommend the -limit option for both cover and assembly steps (this allows you to specify a maximum index and spread the work across multiple jobs rather than specifying multiple CPUs).

anoronh4 commented 3 years ago

sorry, i've been trying to make sense of this and i'm having a really hard time understanding the doc here: https://github.com/cancerit/BRASS/blob/dev/perl/bin/brass.pl#L460-L468

how does -index 1 map to 1,3,5 and -index 2 map to 2,4? what is being limited if it runs three different steps? i'm not as versed in reading perl, so guidance would be much appreciated.

keiranmraine commented 3 years ago

When you specify -limit N you are telling the code that the jobs are to be shared between 'N' processes. When each process starts it takes the total number of jobs that need to be completed and builds lists of work by round-robin allocation of "index" to lists 1-N. Index=1 takes list 1, index=2 list 2 up to index=N.

When limit is in effect the maximum value for index is the same value, so when -limit N, index can be 1..N (each as separate commands).

So if 5 jobs need to be completed and you specify -limit 2 lists for each index are:

  1. [1,3,5]
  2. [2,4]

So in the example of the cover step you would submit 2 independent jobs:

If you specify -limit 3

  1. [1,4]
  2. [2,5]
  3. [3]

For human I'd recommend a limit of 4 (therefore 4 jobs with incrementing index).

One of the main reasons for this is so that you don't have to deal with different numbers of jobs that need doing you make a decision up front generalisation.

Please note that the jobs will complete successfully if you specify a limit greater than the number of jobs, e.g. 2 jobs required but -limit 3 (index 1-3). All that would happen is the list for index 3 would be empty, it would just treat this as though it had completed all work:

  1. [1]
  2. [2]
  3. []
anoronh4 commented 3 years ago

thanks that's pretty clear now! i've already tested it out and it works pretty well, REALLY cut down on my processing time and allows me to lower cpu per job.