How to restart Falcon after filling the memory?

PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

Other

205 stars 103 forks source link

How to restart Falcon after filling the memory? #336

Open lauraalazar opened 8 years ago

lauraalazar commented 8 years ago

I needed to stop falcon because it was filling the memory. Here is my fc_run.cfg (for a genome of size ~150Mb and a computer with 500GB memory):

[General] job_type=local

# list of files of the initial subread fasta files input_fofn = input.fofn

input_type = raw #input_type = preads

# The length cutoff used for seed reads used for initial mapping length_cutoff = 12000

# The length cutoff used for seed reads usef for pre-assembly length_cutoff_pr = 12000

# Cluster queue setting sge_option_da = sge_option_la = sge_option_pda = sge_option_pla = sge_option_fc = sge_option_cns =

# concurrency settgin pa_concurrent_jobs = 48 cns_concurrent_jobs = 48 ovlp_concurrent_jobs = 48

# overlapping options for Daligner pa_HPCdaligner_option = -v -dal128 -e.70 -l1000 -s1000 -M480 ovlp_HPCdaligner_option = -v -dal128 -h60 -e.96 -l500 -s1000 -M480

pa_DBsplit_option = -x500 -s400 ovlp_DBsplit_option = -x500 -s400

# error correction consensus optione falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 6

# overlap filtering options overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20 --bestn 10

I would really appreciate comments on : 1) How can I resume falcon to finish what it was doing (folder 1-preads_ovl is still empty)? 2) What set of parameters would avoid filling the memory? More generally, do the parameters I'm using make sense?

Thanks!

mseetin commented 8 years ago

Your issues are almost the number of concurrent jobs and -M. The daligner jobs use 4 cores each by default, so unless you have 192 cores in this system, that's probably too high to begin with. Also, your -M480 flag lets daligner use as much as 480 GB per job. If you use -M32, which we use internally most of the time, then you'd need to turn down your concurrent jobs to under 15. That may still be too taxing on your system, depending on it file i/o capabilities, so if you're still struggling to get this to run, try turning down the number of concurrent jobs some more.

Also, don't use --min_cov 20 unless you're using an abnormally large amount of coverage in this assembly. We most often use 2, but if you want to be more cautious about misassembly, use 3-5.

On Mon, Apr 18, 2016 at 2:32 AM, lauraalazar notifications@github.com wrote:

I needed to stop falcon because it was filling the memory. Here is my fc_run.cfg (for a genome of size ~150Mb and a computer with 500GB memory):

[General] job_type=local

list of files of the initial subread fasta files

input_fofn = input.fofn

input_type = raw

input_type = preads

The length cutoff used for seed reads used for initial mapping

length_cutoff = 12000

The length cutoff used for seed reads usef for pre-assembly

length_cutoff_pr = 12000

Cluster queue setting

sge_option_da = sge_option_la = sge_option_pda = sge_option_pla = sge_option_fc = sge_option_cns =

concurrency settgin

pa_concurrent_jobs = 48 cns_concurrent_jobs = 48 ovlp_concurrent_jobs = 48

overlapping options for Daligner

pa_HPCdaligner_option = -v -dal128 -e.70 -l1000 -s1000 -M480 ovlp_HPCdaligner_option = -v -dal128 -h60 -e.96 -l500 -s1000 -M480

pa_DBsplit_option = -x500 -s400 ovlp_DBsplit_option = -x500 -s400

error correction consensus optione

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 6

overlap filtering options

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20 --bestn 10

I would really appreciate comments on : 1) How can I resume falcon to finish what it was doing (folder 1-preads_ovl is still empty)? 2) What set of parameters would avoid filling the memory? More generally, do the parameters I'm using make sense?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/FALCON/issues/336

pb-cdunn commented 8 years ago

Beyond what @mseetin posted, I'm curious about your daligner run. There should be a log-file corresponding to it. Find which log-file is not empty, something like 0-rawreads/job_0000/rj_0000.sh.log, and please post the contents. I'm looking for something like:

Building index for raw_reads.1
...
Comparing raw_reads.1 to raw_reads.1

   Capping mutual k-mer matches over 10000 (effectively -t100)
   Hit count = 233,189
   Highwater of 0.01Gb space

That will tell us your effective -t and approximately how much memory you might use. Because HPCdaligner job construction is not quite even, a few jobs (especially job-0000) will need less than the rest, but this will give us an idea.

Also, please refer to these for proper github-flavored markdown:

lauraalazar commented 8 years ago

Thank you very much for the replies! This is the output of job_001d/rj_001d.sh.log

trap 'touch /scratch/lsalazar/pacbio_falcon/0-rawreads/job_001d/job_001d_done.exit' EXIT
+ trap 'touch /scratch/lsalazar/pacbio_falcon/0-rawreads/job_001d/job_001d_done.exit' EXIT
cd /scratch/lsalazar/pacbio_falcon/0-rawreads/job_001d
+ cd /scratch/lsalazar/pacbio_falcon/0-rawreads/job_001d
hostname
+ hostname
bigshot
date
+ date
Wed Apr 13 15:22:09 BST 2016
time daligner -v -H12000 -e0.7 -s1000 -M480 raw_reads.25 raw_reads.1 raw_reads.2 raw_reads.3 raw_reads.4 raw_reads.5 raw_reads.6 raw_reads.7 raw_reads.8 raw_reads.9 raw_reads.10 raw_reads.11 raw_reads.12 raw_reads.13 raw_reads.14 raw_reads.15 raw_reads.16 raw_reads.17 raw_reads.18 raw_reads.19 raw_reads.20 raw_reads.21 raw_reads.22 raw_reads.23 raw_reads.24 raw_reads.25
+ daligner -v -H12000 -e0.7 -s1000 -M480 raw_reads.25 raw_reads.1 raw_reads.2 raw_reads.3 raw_reads.4 raw_reads.5 raw_reads.6 raw_reads.7 raw_reads.8 raw_reads.9 raw_reads.10 raw_reads.11 raw_reads.12 raw_reads.13 raw_reads.14 raw_reads.15 raw_reads.16 raw_reads.17 raw_reads.18 raw_reads.19 raw_reads.20 raw_reads.21 raw_reads.22 raw_reads.23 raw_reads.24 raw_reads.25

Building index for raw_reads.25

 Kshift=28
 BSHIFT=8
 TooFrequent=2147483647
 (Kshift-1)/BSHIFT + (TooFrequent < INT32_MAX)=3
 sizeof(KmerPos)=16
 nreads=50886
 Kmer=14
 block->reads[nreads].boff=419483018
 kmers=418770614
 sizeof(KmerPos)*(kmers+1)=6700329840
 Allocated 418770615 of 16 (6700329840 bytes) at 0x2b7ecc862010
   Kmer count = 418,770,614
   Using 12.48Gb of space
   Index occupies 6.24Gb

Building index for raw_reads.1

 Kshift=28
 BSHIFT=8
 TooFrequent=2147483647
 (Kshift-1)/BSHIFT + (TooFrequent < INT32_MAX)=3
 sizeof(KmerPos)=16
 nreads=51871
 Kmer=14
 block->reads[nreads].boff=419485493
 kmers=418759299
 sizeof(KmerPos)*(kmers+1)=6700148800
 Allocated 418759300 of 16 (6700148800 bytes) at 0x2b7ecc862010
   Kmer count = 418,759,299
   Using 12.48Gb of space
   Index occupies 6.24Gb

Comparing raw_reads.25 to raw_reads.1

   Capping mutual k-mer matches over 10000 (effectively -t100)
   Hit count = 4,256,934,003
   Highwater of 133.11Gb space

And would I need to start again falcon with the corrected -M and --min_cov, or can I resume with what it has run so far?

pb-cdunn commented 8 years ago

   Capping mutual k-mer matches over 10000 (effectively -t100)
   Hit count = 4,256,934,003
   Highwater of 133.11Gb space

That's what I thought. If you really want such a high -M480, you need to bump MAXGRAM in DALIGNER/filter.c from 10000 to 160000 and rebuild. I don't know whether that will work, and I'm not sure it's what you want anyway.

Your goal is to reduce the total inode count, right? Try running one job at a time (*_concurrent_jobs = 1, or least something much lower than 48). Each daligner job includes top-level .las merging, so that should keep the count low. But there are other files for each job.

I suspect that you are using too many reads. Try:

genome_size = 150000000
seed_coverage = 25
length_cutoff = -1

You will see a file call 0-rawreads/length_cutoff which tells you the calculated value. Could you post that?

lauraalazar commented 8 years ago

Thanks again! I will need to digest all this information, but for now I don't see a file called 0-rawreads/length_cutoff. These are all the files and sub-folders in the 0-rawreads/ folder:

input.fofn  job_0003  job_0007  job_000b  job_000f  job_0013  job_0017  job_001b        prepare_rdb.sh.log
job_0000    job_0004  job_0008  job_000c  job_0010  job_0014  job_0018  job_001c        raw_reads.db
job_0001    job_0005  job_0009  job_000d  job_0011  job_0015  job_0019  job_001d        rdb_build_done
job_0002    job_0006  job_000a  job_000e  job_0012  job_0016  job_001a  prepare_rdb.sh  run_jobs.sh

pb-cdunn commented 8 years ago

You will not see length_cutoff unless you set it to -1 in your config. You might also need to update FALCON for the new auto-cutoff-calculation feature.