Getting fastest speed with job_type local (what options affect speed, concurrent_jobs, --n_core, -M, -T, length_cutoff, -s)

Dear Devs, I don't believe this is a duplicate question. Perhaps it can be put in FAQ. I am using FALCON-integrate 1.7.5. As the question states, I would like to know how to run FALCON quickly on a large genome (575 mega base pairs, diploid). I am using job_type local because sge does not work on our system. The system has 24 cores and 504GB of memory or 64 cores and 2TB of memory. I have a vague idea that length_cutoff, concurrent_jobs, daligner_option -s | -M | -T, and falcon_sense_option --n_core can affect cpu and memory usage and speed, but I would like to get a better understanding of what they do. I have read the wiki, the manual and Gene Myers's blog but was unable to figure them out.

I attempted to use a config file with mostly defaults to run on the data but it hit the walltime of 124 hours and terminated. I believe it did not make it out of the initial raw read stage. I used a PBS script requesting 24 cpus and 500gb of memory. Here is the config file I used for the 24 core machine:

[General]
job_type=local
input_fofn = input.fofn
input_type = raw

# The length cutoff used during the error correction process for seed reads used for initial mapping
length_cutoff = 1000
# The length cutoff used for the later assembly overlapping step for seed reads used for pre-assembly
length_cutoff_pr = 1000

sge_option_da  =
sge_option_la  =
sge_option_pda =
sge_option_pla =
sge_option_fc  =
sge_option_cns =

# DALIGNER uses 4 cores per thread by default so use cpus / 4
pa_concurrent_jobs   = 6
ovlp_concurrent_jobs = 6

# -dal is deprecated so use -B
# -M should be total mem / # concurrent_jobs
pa_HPCdaligner_option   = -v -B4 -t16      -e0.70 -l500 -s500 -M15
ovlp_HPCdaligner_option = -v -B4 -t32 -h60 -e0.96 -l500 -s500 -M15

pa_DBsplit_option   = -x500 -s50
ovlp_DBsplit_option = -x500 -s50

# guessing with --n_core
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 4
overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 2 --n_core 4

Thank you for your assistance.

The parameters look reasonable, for a specific assembly the only way to really optimize for speed is to monitor the load on the computer when the assembly is running. By far the most intensive stage is the initial overlap, so you should probably concentrate on that first. Given the compute resources you have, and the size of the genome, it will likely take a long time to assemble on your system. For the 124 hour run, what was the load like on the machine? were you using close to the maximum CPU / memory? In the 0-rawreads directory, how many of the job???? tasks completed `find -name "job????_done" | wc? How many total task are there to runcat ./0-rawreads/run_jobs.sh | grep daligner | wc`

Unfortunately the run directory was cleared for other test runs but here's what I remember: The load seemed okay, the PBS log file for the job said that it used all 24 cores and close to the max memory. Since the directory was cleared, I cannot answer how many had job*done. One of the older runs (killed early) used this config file:

[General]
job_type=local
input_fofn = input.fofn
input_type = raw

# The length cutoff used during the error correction process for seed reads used for initial mapping
length_cutoff = 1000
# The length cutoff used for the later assembly overlapping step for seed reads usef for pre-assembly
length_cutoff_pr = 1000

sge_option_da =
sge_option_la =
sge_option_pda =
sge_option_pla =
sge_option_fc =
sge_option_cns =

pa_concurrent_jobs = 23
ovlp_concurrent_jobs = 23

pa_HPCdaligner_option   = -v -B4 -t16 -e0.70 -l500 -s500
ovlp_HPCdaligner_option = -v -B4 -t32 -h60 -e0.96 -l500 -s500

pa_DBsplit_option   = -x500 -s50
ovlp_DBsplit_option = -x500 -s50

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 2
overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 2

The total tasks with this config file (using cat ./0-rawreads/run_jobs.sh | grep daligner | wc -l) was 65160.

I suppose I should calculate or make educated guesses about the options but I need more information regarding how.

Can you tell me what falcon_sense_option --n_core is used for and how should I specify it?
Can you also tell me what *daligner_option -s is and its effects? I did read the documentation but I am not understanding.

Finally,

(After I can get past the first stage) Could I run just the initial overlapping stage then when it is finished, start a new job to continue the rest of the stages? If so, how would I "resume" FALCON when it is terminated?

Thanks for the quick response!

The number of jobs being created is probably too large. The -s option (size of the sequence blocking in daligner jobs) of the DBsplit_option parameters should be increased, try starting with 200 (megabase). The falcon sense --n_core is the number of threads used in the calculation of consensus after the initial alignment, I notice you are missing the cns_concurrent_jobs option this x --n_core will dictate the total number of cores used during the consensus step. The daligner -s option is the number of trace points stored in the alignment output, it does not dramatically effect performance, it's safe to use the default 100. Falcon will automatically restart from where it left off, simply delete the mypwatcher directory and run the same command again, provided there are no comflicting parameter setting is the two runs, e.g. changing the -s parameter to the DBsplit option would not be compatible with restating in the middle of the initial alignment.

Thank you! Your answer is very helpful in correcting my understanding of the options. Some more questions for clarification:

Should cns_concurrent_jobs specify the same number of threads as pa and ovlp or can it be a higher number?
Do I set cns_concurrent_jobs based on the --n_core number in falcon_sense_option (so cns_concurrent_jobs = total cpus / falcon_sense_option n_core) or is it unrelated?
If falcon_sense_option --n_core is unrelated, should I set this to the total number of cpus available (24) or should I keep it at 2 or 4?
Does overlap_filtering_setting also have the --n_core setting? If so, does it specify the total number of cores to use in the overlapping step and is it related to ovlp_concurrent_jobs (see question 2)?

Your answer really clarified why my test runs did not improve their run time despite me changing length_cutoff, -M, concurrent_jobs, and --n_cores. I will continue testing with a subset of the data to see how setting -s in DBsplit to higher numbers will affect speed.

Questions 1,2 & 3 are all related, basically yes, cns_concurrent_jobs x --n_core (falcon_sense_option) = total_cpus.

Overlap filtering is a single job, and ran on a single node, there is no concurrent option, so long as the --n_core setting is < slots_on_a_node then it isn't a problem. It isn't a computationally expensive step, --n__core 12 is a safe number

I see. Final questions: Is there a method for choosing DBsplit -s? Or is it guesswork? For example, is there a relation to the size of the genome? I was testing with smaller subsets of data and when I compare -s 100 and -s 500, -s 500 is over 3 times faster. So since our genome is about 575 Mbp, should I use -s 500 or is it unrelated?

Along with my previous questions, I have some more questions to ask. I have been running many tests on small subsets of data and am seeing strange results. Case 1: Setting -M sometimes does not seem to affect the amount of memory used and the job is killed because it exceeds the amount of memory allowed. However, in other tests, setting -M to a lower number makes it run twice as fast. Case 2: Requesting 24 cpus and 64gb of memory and specifying overlap_filtering_setting --n_core 4, runs successfully in 00:11:09 and uses 58.2gb of memory. However, setting it to 12 with the same requested resources makes it use 70.9gb before it is killed. See test 4 and test 8 in the table below. Case 3: I am running into an unknown error with using settings similar to the following config file.

[General]
job_type=local
input_fofn = input.fofn
input_type = raw

length_cutoff = 1000
length_cutoff_pr = 1000

sge_option_da  =
sge_option_la  =
sge_option_pda =
sge_option_pla =
sge_option_fc  =
sge_option_cns =

pa_concurrent_jobs   = 6
cns_concurrent_jobs  = 6
ovlp_concurrent_jobs = 6

pa_HPCdaligner_option   = -v -B4 -t16      -e0.70 -l500 -s100 -M10
ovlp_HPCdaligner_option = -v -B4 -t32 -h60 -e0.96 -l500 -s100 -M10

pa_DBsplit_option   = -x500 -s500
ovlp_DBsplit_option = -x500 -s500

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 4
overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 2 --n_core 4

I requested 24 cpus and 64gb. The job exits after less than 2 minutes. Here is a table of some of the various changes I made and the exit status of each.

| Test | length | concur | daligner | DBsplit | falcon_sense | overlap_filt | status |  error/  |
|   #  | cutoff |  jobs  |    -M    |    -s   |   --n_core   |   --n_core   |        | run time |
| ---- | ------ | ------ | -------- | ------- | ------------ | ------------ | ------ | -------- |
|    1 |   1000 |      6 |       10 |     100 |            4 |            4 |   good | 01:38:26 |
|    2 |   1000 |      6 |       10 |     200 |            4 |            4 |   good | 00:49:13 |
|    3 |   1000 |      6 |       10 |     300 |            4 |            4 |   fail | mem over |
|    4 |   1000 |      6 |        8 |     300 |            4 |            4 |   good | 00:11:09 |
|    5 |   1000 |      5 |       10 |     300 |            4 |            4 |   good | 00:25:16 |
| ---- | ------ | ------ | -------- | ------- | ------------ | ------------ | ------ | -------- |
|    6 |   1000 |      6 |        8 |     300 |            4 |           48 |   fail | mem over |
|    7 |   1000 |      6 |        8 |     300 |            4 |           24 |   fail | mem over |
|    8 |   1000 |      6 |        8 |     300 |            4 |           12 |   fail | mem over |
|    9 |   1000 |      6 |        8 |     300 |            4 |            1 |   good | 00:11:20 |
| ---- | ------ | ------ | -------- | ------- | ------------ | ------------ | ------ | -------- |
|   10 |   1000 |      6 |       10 |     500 |            4 |            4 |   fail |  unknown |
|   11 |   1000 |      6 |       10 |     500 |            4 |           12 |   fail |  unknown |
| ---- | ------ | ------ | -------- | ------- | ------------ | ------------ | ------ | -------- |
|   12 |   1000 |      4 |       15 |     500 |            6 |            6 |   fail | mem over |
|   13 |   1000 |      4 |       10 |     500 |            6 |            6 |   fail |  unknown |
|   14 |   1000 |      4 |       15 |     500 |            6 |           12 |   fail | mem over |
|   15 |   1000 |      4 |       10 |     500 |            6 |           12 |   fail | mem over |
| ---- | ------ | ------ | -------- | ------- | ------------ | ------------ | ------ | -------- |
|   16 |   2000 |      6 |        8 |     300 |            4 |            4 |   fail | mem over |
|   17 |   2000 |      5 |       10 |     300 |            4 |            4 |   fail | mem over |
|   18 |   2000 |      4 |       10 |     300 |            6 |            6 |   fail | mem over |
| ---- | ------ | ------ | -------- | ------- | ------------ | ------------ | ------ | -------- |
|   19 |   2000 |      6 |       10 |     500 |            4 |            4 |   fail |  unknown |
|   20 |   2000 |      4 |       15 |     500 |            6 |            6 |   fail | mem over |
|   21 |   2000 |      4 |       10 |     500 |            6 |            6 |   fail |  unknown |
|  ... |        |        |          |         |              |              |        |          |

This brings up the following questions:

How should -M in *daligner_option be calculated and how does it affect speed?
Do you have an explanation for Case 2? I have config files and run logs if you would like to see them.
Do you know what is happening in Case 3?
Is there documentation for all the exit codes that FALCON uses? For example, "[ERROR]Task Fred{'URL': 'task://localhost/d_0000_raw_reads'} failed with exit-code=256"

Thank you for your time.

PacificBiosciences / FALCON

Getting fastest speed with job_type local (what options affect speed, concurrent_jobs, --n_core, -M, -T, length_cutoff, -s) #478