PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 103 forks source link

Running out of memory? Need help with settings. #344

Closed arthurmelobio closed 8 years ago

arthurmelobio commented 8 years ago

Hi Falcon team,

I’m trying to assembly a diploid plant genome with estimated size in approximately 800 Mb. We have 116 SMRT cells, 12,190,085 of raw reads (from Secondary Analysis Results) and on average 120X of coverage. In addition, the 30X longest read have, on average, 18,600 bp. The parameters used for our project was based on “pb-jchin” suggests for assemble a 1Gb plant genome (Falcon Github discussion #308) and two rounds of discussion with Roberto Lleras. We are working using Falcon 0.4.2 installed on iplant/CyVerse. The job configuration file was set like:

[General]
list of files of the initial bas.h5/fasta files
input_fofn = input.fofn

Input either raw fasta reads or pre-corrected preads
input_type = raw

The length cutoff used for seed reads used for initial mapping
length_cutoff = 18600

The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 13600

Use /tmp
use_tmpdir = True

job_type = local
job_type = SLURM
jobqueue = normal
allocation = iPlant-Master
ncores = 16
sge_option_da  = -n %(ncores)s -t 05:00:00 -A %(allocation)s -p %(jobqueue)s
sge_option_la  = -n %(ncores)s -t 05:00:00 -A %(allocation)s -p %(jobqueue)s
sge_option_pda = -n %(ncores)s -t 05:00:00 -A %(allocation)s -p %(jobqueue)s
sge_option_pla = -n %(ncores)s -t 05:00:00 -A %(allocation)s -p %(jobqueue)s
sge_option_fc  = -n %(ncores)s -t 05:00:00 -A %(allocation)s -p %(jobqueue)s
sge_option_cns = -n %(ncores)s -t 05:00:00 -A %(allocation)s -p %(jobqueue)s

maxJobs = 30
pa_concurrent_jobs = %(maxJobs)s
ovlp_concurrent_jobs = %(maxJobs)s

pa_HPCdaligner_option = -M20 -dal128 -t18 -e0.75 -l100 -s500 -h1250
ovlp_HPCdaligner_option = -M20 -dal128 -t24 -e0.96 -l100 -s500 -h1250

pa_DBsplit_option = -x2500 -s200
ovlp_DBsplit_option = -x1500 -s200

falcon_sense_option = --output_multi --min_idt 0.7 --min_cov 4 --max_n_read 100 --n_core 9
cns_concurrent_jobs = %(maxJobs)s

overlap_filtering_setting = --max_diff 120 --max_cov 120 --min_cov 4 --bestn 10 --n_core 16

We have a couple of problems… The CyVerse reports that analysis was completed but the “2-asm-falcon” folder is empty. Unfortunately, the analysis on CyVerse does not report the intermediary folders like “0-rawreads” and “1-preads-ovl” which do not allow me check either the files “raw_reads.db” and “prepare_rdb.sh.log”, for example.

However, looking for the error file (*.err) that is a huge file (27.8 Mb) I found in the begin of file some warnings and one error:

{
  "status" : "success",
  "message" : null,
  "version" : "2.1.6-r34db685",
  "result" : {
    "id" : "3650947532579925530-242ac1112-0001-007",
    "name" : "f7b10a12-d404-4e61-a9d7-fa229636d122_0001",
    "owner" : "arthurmelo",
    "appId" : "FALCON-0.4.2u3",
    "executionSystem" : "workflow.stampdede.tacc.utexas.edu",
    "batchQueue" : "default",
    "nodeCount" : 1,
    "processorsPerNode" : 1,
    "memoryPerNode" : 1.0,
    "maxRunTime" : "72:00:00",
    "archive" : true,
    "retries" : 0,
    "localId" : "6691",
    "created" : "2016-04-25T10:45:57.000-05:00",
    "archivePath" : "/arthurmelo/FalconOUT/FALCON_BerberyGenomeAssembly-2016-04-25-15-45-55.8",
    "archiveSystem" : "data.iplantcollaborative.org",
    "outputPath" : "arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001",
    "status" : "RUNNING",
    "submitTime" : "2016-04-25T12:20:19.000-05:00",
    "startTime" : "2016-04-25T12:20:20.950-05:00",
    "endTime" : null,
    "inputs" : {
      "fastas" : "agave://data.iplantcollaborative.org/arthurmelo/PacbioData"
    }
.
.
.
[ERROR]PypeTaskBase failed:
{'__class__.__name__': 'PypeThreadTaskBase',
 '_status': 'TaskInitialized',
 'inputDataObjs': {'rdb_build_done': PypeLocalFile('file://localhost/scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done', '/scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done')} …

Looking more detail to the error “PypeTaskBase failed: PypeThreadTaskBase” I found different discussion on GitHub like the thread #173 where it seems to me the –t16 daligner parameter is too small for big genomes. However, on Falcon thread #308, “pb-jchin” suggest –t18 is enough for assemble a 1Gb plant genome. The author of thread #173 also said when he decrease “length_cutoff” (default is 12000) the issue was resolved. We are using 18600 because this is your 30x longest read length and I don’t know if this is the point of error. It seems to me daligner do not works producing the .las files, creating the preads for overlap assembly.

In addition, in the end of the error file (*.err) we have another error message:

[ERROR]Any exception caught in RefreshTargets() indicates an unrecoverable error. Shutting down...
Traceback (most recent call last):
  File "/work/0004/iplant/public/apps/falcon/0.4.2/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 522, in refreshTargets
    rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/work/0004/iplant/public/apps/falcon/0.4.2/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 738, in _refreshTargets
    raise TaskFailureError("Counted %d failure(s) with 0 successes so far." %failedJobCount)
TaskFailureError: 'Counted 30 failure(s) with 0 successes so far.' 

According to this last error (RefreshTargets() indicates an unrecoverable error. Shutting down...), I found on discussion #300 that the flag “local_match_count_threshold” could be the cause of the problem, but we don’t have this flag in our configuration file.

There are some interesting points to me:

  1. When I run the same configuration file using only one SMRT cell the Falcon works normally.
  2. The time between the messages “your job start to run” until “your job was completed” is something like 12 hours. It seems to me the daligner try to works but can’t due the not enough memory.

The *.out file produce the follow message:

Welcome to the Stampede Supercomputer              
--> Verifying valid submit host (vlogin01)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/0004/iplant)...OK
--> Verifying availability of your work dir (/work/0004/iplant)...OK
--> Verifying availability of your scratch dir (/scratch/0004/iplant)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (iPlant-Master)...OK
Submitted batch job 6942872
['/work/0004/iplant/public/apps/falcon/0.4.2/bin/fc_run.py', 'job.cfg']

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Please wait for all threads / processes to terminate !
! Also, maybe use 'ps' or 'qstat' to check all threads,  !
! processes and/or jobs are terminated cleanly.          !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Again, it seems to me the CyVerse do not assign enough memory to assembly this whole dataset (116 SMRT cells ~ 120X). But, I don’t know if I’m correct in this affirmative and I also don’t know if the two error messages could be printed due the lack of memory and if they are related each other. So any suggestion/thoughts will be appreciated.

Thanks a lot for any help!

pb-jchin commented 8 years ago

For smaller memory machines, one can try smaller "-s" in DBsplit to reduce memory footprint, however, it will generate more intermediate files. This is part of daligner design and I have not find a way to work around it. Also, instead of using -t to reduce repetitive k-mer, one can use -M in pa_HPCdaligner_option to cap the memory usage. Since you already set that, do you check the if daligner uses way more than 20G from the logs in sge_log/? It seems that you have good read length and high coverage, in that case, you might try to make the alignment parameter more sensitive (by changing -h, -k, -l and -w in daligner parameters) and use less computation.

pb-cdunn commented 8 years ago

For smaller memory machines, one can try smaller "-s" in DBsplit to reduce memory footprint, however, it will generate more intermediate files.

To keep the number of post-daligner files roughly constant, increase -dal (number of daligner block-comparisons per call) twice as much as you reduce -s (number of MB per block). We merge all the first level of .las files immediately after running daligner, so the number of files before the merge step would remain roughly constant. Run with lower concurrency to reduce the number of simultaneous .las files. And if you run only 1 daligner job per machine, then you can drop -t and use -M0 to use all available memory and nothing more.

Later, wait for an update to **FALCON*, and you'll be able to run the jobs on the local disk (not currently implemented, though it was available in an earlier version), so the cost of the temporary .las files will be small.

But it's wise to go into a single job-directory and fiddle with the daligner settings as pb-jchin describes. You can run a single job yourself, "manually", and you can see how different settings affect the numbers of alignments in the resulting .las files. It's good to get some experience with smaller genomes first.

arthurmelobio commented 8 years ago

Hi Jason Chin and Christopher Dunn, thanks both of you for the quick feedback. I appreciated.

On sge_log I have:

cd /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads
cd /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads
trap 'touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit' EXIT
trap 'touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit' EXIT
ls -il prepare_rdb.sub.sh
ls -il prepare_rdb.sub.sh
231260396382642738 -rwxr-xr-x 1 iplant G-802086 495 Apr 25 21:28 prepare_rdb.sub.sh
hostname
hostname
c426-601.stampede.tacc.utexas.edu
ls -il prepare_rdb.sub.sh
ls -il prepare_rdb.sub.sh
231260396382642738 -rwxr-xr-x 1 iplant G-802086 495 Apr 25 21:28 prepare_rdb.sub.sh
time /bin/bash ./prepare_rdb.sub.sh
/bin/bash ./prepare_rdb.sub.sh
fasta2DB -v raw_reads -f/scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/input.fofn
fasta2DB -v raw_reads -f/scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/input.fofn

DBsplit -x2500 -s200 raw_reads
DBsplit -x2500 -s200 raw_reads
LB=$(cat raw_reads.db | awk '$1 == "blocks" {print $3}')
cat raw_reads.db | awk '$1 == "blocks" {print $3}')
cat raw_reads.db | awk '$1 == "blocks" {print $3}'
cat raw_reads.db
awk '$1 == "blocks" {print $3}'
LB=582
HPCdaligner -M20 -dal128 -t18 -e0.75 -l100 -s500 -h1250 -H18600 raw_reads 1-$LB >| /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/run_jobs.sh
HPCdaligner -M20 -dal128 -t18 -e0.75 -l100 -s500 -h1250 -H18600 raw_reads 1-582
real    48m5.262s
user    7m20.127s
sys 13m22.532s
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit

It seems to me the daligner are using 20G by the -M20 flag.

Unfortunately, using Falcon on iPlant/CyVerse I can't control -dal, -s and -w flags. They are automatically set like -dal128, -s500 on both "pa_HPCdaligner_option" and ovlp_HPCdaligner_option and -s 200 on both "pa_DBsplit_option" and "ovlp_DBsplit_option".

According to the daligner parameters you suggest to change (-h, -k, -l and -w) in order to decrease the memory usage, what you recommend? For example: we are using -h1250, -k18, -l100. For decrease my memory usage, probably I can increase -l2500 and -k20, keeping -h1250. Do you agree that this modifications could decrese the memory consumption?

Thanks a lot. Arthur

pb-cdunn commented 8 years ago

You can't set pa_HPCdaligner_option yourself, or the others? If iPlant/CyVerse is setting these for you, then talk to the owners of that system. We cannot help with that.

I'll let the expert give advice on -h -k -l -w, but it can be difficult remotely. Plus, Jason is very busy.

arthurmelobio commented 8 years ago

Hi Christopher, thanks for your replay.

On "pa_HPCdaligner_option" and "oval_HPCdaligner_option" I can set some flags like -t, -e, -l and -h. Other like -M -dal, -s and -w I can't. On "pa_DBsplit_option" and "oval_DBsplit_option" I can control -x flag, while the -s I can't.

For sure I will contact the CyVerse team and the appropriate person who installed and hosted Falcon 0.4.2 on CyVerse. However, my question here is understand how the daligner flags -t, -e, -l and -h affect the memory consumption of the analysis, in order to set them appropriated for assembly a huge dataset (116 SMRT cells ~ 120X of coverage).

Thank you again.

pb-cdunn commented 8 years ago

-e is the error rate, or rather one minus the error rate. It needs to match your data.

-t does the same as -M, but in a different way. -M stores as many kmer hits as fit in memory, while -t ignores anything with a high frequency. Both are useful for filtering homopolymer runs, and somewhat useful for ignoring repeats. But mainly, they just reduce memory and runtime by ignoring kmer hits that are probably spurious anyway.

But I wouldn't use -t to control total memory. You need to work with your system owners.