PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release:
205 stars 103 forks source link

too long time running run_filter_stage2 #372

Open tangerzhang opened 8 years ago

tangerzhang commented 8 years ago

Hello, I am working on a plant genome pacbio assembly and I got 52 X corrected reads. When feeding these preads to FALCON assembly, it took me more than two days running run_filter_stage2 and has not finished right now. I checked the las.fofn file, which contains 323036 lines. I assume that the long running time is caused by so many las files? Is that normal? Any suggestions? Thanks a lot!

###My configure file looks like:
input_fofn = preads.fofn
input_type = preads
length_cutoff = 10000
length_cutoff_pr = 9000 
sge_option_da = -pe orte 8 -q all.q
sge_option_la = -pe orte 8 -q all.q
sge_option_pda = -pe orte 8 -q all.q
sge_option_pla = -pe orte 8 -q all.q
sge_option_fc = -pe orte 8 -q all.q
sge_option_cns = -pe orte 8 -q all.q
pa_concurrent_jobs = 60
cns_concurrent_jobs = 60
ovlp_concurrent_jobs = 60
pa_HPCdaligner_option =  -v -dal4 -t16 -e.70 -l1000 -s1000  
ovlp_HPCdaligner_option = -l4800 -k18 -h480 -w8 -H15000 -M32
pa_DBsplit_option = -x200 -s50
ovlp_DBsplit_option = -x200 -s50
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 3  --max_n_read 200 --n_core 6 
overlap_filtering_setting = --max_diff 100 --max_cov 80 --min_cov 2 --bestn 10 --n_core 24
pb-jchin commented 8 years ago

yes. you need -dal option on the ovlp_HPCdaligner_option parameters. You have way to many smaller las files for the filter to go through. The excessive shell processes probably is the culprit of the slowness. Try "-dal128" (in newer version "-B128") to reduce the final number of merged files in the final overlapping stage. I typically watch how many merge jobs will be there by examining the 1-preads_ovl/

pb-jchin commented 8 years ago

Another note, if you have already get many many small las files, you could manually merge them and ask to take the merged las files as input. However, you have to make sure you don't redundant entries in the merged files.

tangerzhang commented 8 years ago

Thanks Jason. I have re-sumbited the job with -dal128. I will see the results. That would take too long.

2016-05-24 10:41 GMT+08:00 Jason Chin

Another note, if you have already get many many small las files, you could manually merge them and ask to take the merged las files as input. However, you have to make sure you don't redundant entries in the merged files.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub

tangerzhang commented 8 years ago

Hi Jason, I tried -B128 but still have the same problem. I think it might be a bug after I updating the latest falcon release. My previous run (successful case) in which I used falcon v0.4 generate a las.fofn file contain only preads.*.las. The context of las.fofn is attached below:


However, the failure one (latest falcon release) generated a las.fofn file which contains all las file, including L1.*.las, L2.*.las and preads.*.las. Part of the file were attached below:


Is this a bug or anything I did wrong? I can only use preads.*las right now but I would like to know what cause this problem. I could avoid this in the future. Thanks!

pb-jchin commented 8 years ago

yes. it is a bug. I submitted a PR already. see

pb-cdunn commented 8 years ago

Could you tell us what commit you are using? git rev-parse HEAD. Did you simply download the latest release. I am about to issue a new release with the fix.

The good news is that you will not need to re-run everything. After updating FALCON (the tip of master is fine), simply:

rm -rf 2-*/
rm -rf 1-*/

And restart. Stage-0 should be fine.