PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 103 forks source link

Falcon assembly #308

Open yilunhuangyue opened 8 years ago

yilunhuangyue commented 8 years ago

hello,I have install falcon, and i have tried the example from https://github.com/PacificBiosciences/FALCON/wiki/Setup%3A-Running

the program ended without error message,but the log file is

2016-03-16 02:50:02,837 - pypeflow.controller - INFO - _refreshTargets() finished with no thread running and no new job to submit
2016-03-16 02:50:02,837 - pypeflow.controller - INFO - _refreshTargets() finished with no thread running and no new job to submit

And the 2-asm-falcon file contains files such as a_ctg_all.fa and p_ctg.fa, I am not sure if it is the assembly result.

besides,I want to assembly a plant genome, the genome size is about 400M. what parameters should be changed in the cfg file?

Thanks a lot for any help!

pb-jchin commented 8 years ago

There is no universal parameters for all genome (yet?) Anyway, depending on the read length distribution and the complexity of the repeats in the genome, there will be different choice to optimize for making the assembly. If you get >20x read > 10kb, I would suggest using length cutoff ~10kb. If not, you will need to include more reads.

pb-jchin commented 8 years ago

Here is a parameter set I use to assemble an 1Gb plant genome for you reference. It won't work if you only copy and paste for you, but I hope it gets you to start.

[General]
# list of files of the initial bas.h5 files
input_fofn = input.fofn
#input_fofn = preads.fofn

input_type = raw
#input_type = preads

# The length cutoff used for seed reads used for initial mapping
length_cutoff = 10000

# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 10000

#you need to change these distributed computation related parameters that fit to your computation cluster configuration  
sge_option_da = -pe smp 4 -q bigmem
sge_option_la = -pe smp 20 -q bigmem
sge_option_pda = -pe smp 6 -q bigmem
sge_option_pla = -pe smp 16 -q bigmem
sge_option_fc = -pe smp 24 -q bigmem
sge_option_cns = -pe smp 8 -q bigmem

pa_concurrent_jobs = 192
cns_concurrent_jobs = 192
ovlp_concurrent_jobs = 192

#Here is a set parameters allowing faster computation but less sensitive for read overlaps
pa_HPCdaligner_option =  -v -dal128  -e0.75 -M24 -l2500 -k18 -h1250 -w8 -s100
ovlp_HPCdaligner_option =  -v -dal128  -M24 -k24 -h1250 -e.96 -l1500 -s100

pa_DBsplit_option = -a -x500 -s200
ovlp_DBsplit_option = -s200

falcon_sense_option = --output_multi --output_dformat --min_idt 0.70 --min_cov 4 --max_n_read 400 --n_core 8
falcon_sense_skip_contained = False

overlap_filtering_setting = --max_diff 120 --max_cov 120 --min_cov 4 --n_core 12
yilunhuangyue commented 8 years ago

thanks a lot for your quick reply! It helps.

pb-cdunn commented 8 years ago

Jason, FALCON now supports auto-calculation of length_cutoff, like this:

length_cutoff = -1
seed_coverage = 20
genome_size = 1000000000
pb-jchin commented 8 years ago

@pb-cdunn thanks for point it out. I think @yilunhuangyue needs to use the master branch for that.