PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
204 stars 103 forks source link

questions about speed and ONT reads #704

Open mycecilia opened 4 years ago

mycecilia commented 4 years ago

Hi, here are my questions on running FALCON. I'm wondering if anyone has tested on these.

  1. I’m assembling ONT reads using FALCON and FALCON unzip. Should I correct the raw ONT reads with CANU or some sort of correction program?

  2. Do you have some suggestions for speeding up FALCON? We have three plant genomes, 350MB, 3GB, and 17GB. So far for 118x ONT reads of a 350 Mb genome, it took me two weeks to finish the 0-rawreads/las-merge-runs stage, which is way too slow.

  3. What's an acceptable low coverage for diploids to adequately assemble primary contigs and haplotigs? I wonder if 50x coverage would just break the assembly down to small contigs or maybe lose some haplotigs while maintaining the assembled N50. Anyone has experience on this matter with lower coverage ONT reads using FALCON?

Here is what I’m planning to speed up the assembly: a. Increase DBsplit_option -s from 100 to 200 to reduce the number of my tan-run jobs, 5461 jobs with -s 100 currently. b. I want to play with njob and NPROC options. But I’m a little unsure about how they play out together. My local server has 48 cpus and 560 GB memory.

Thanks you in advance for any suggestion.

Here is my current run_falcon.cfg file for the 118x corrected-ONT reads for 350 Mb genome:

[General]
input_fofn = input_run1.fofn
input_type = raw 

pa_DBsplit_option = -a -x500 -s200
ovlp_DBsplit_option = -a -x500 -s200

ovlp_HPCTANmask_option = 
pa_REPmask_code = 0,300;0,300;0,300

genome_size = 350000000
seed_coverage = 80

length_cutoff = -1
length_cutoff_pr = 1500

pa_HPCdaligner_option = -v -B4 -M16
pa_daligner_option =  -e.70 -l1000 -s100 
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 12

ovlp_HPCdaligner_option = -v -B4 -M32 
ovlp_daligner_option = -h60 -e.96 -l500 -s1000

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20 --bestn 10

[job.defaults]
use_tmpdir = ./tmp
stop_all_jobs_on_failure = true
pwatcher_type = blocking
job_type = local
JOB_QUEUE=default
submit = /bin/bash -c "${JOB_SCRIPT}" > "${JOB_STDOUT}" 2> "${JOB_STDERR}"

[job.step.da]
NPROC=8

[job.step.la]
NPROC=8

[job.step.cns]
NPROC=12

[job.step.pda]
NPROC=8

[job.step.pla]
NPROC=8

[job.step.asm]
NPROC=24