PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 103 forks source link

Examples of up-to-date cfg files for FALCON? #658

Closed apredeus closed 6 years ago

apredeus commented 6 years ago

Hello all,

I am struggling figuring thing out as to what are the recommended settings for FALCON, and how to use it locally on a large SMP node (64-128 cores). I've found tutorials, but the logs suggest many options are outdated. So, are there any examples of up-to-date config for a local assembly run?

On a separate note, what assembly options should be tweaked when attempting to assemble a genome? The given config files are for E.coli; surely the parameters would be different for a 500 Mb insect genome, for example?

Thank you for all suggestions.

apredeus commented 6 years ago

The config file for the test E.coli run that I have right now is as follows. However, it fails with the following messages:

2018-07-25 17:12:59,484 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00007) failed with exit-code=1 2018-07-25 17:12:59,484 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00011) failed with exit-code=1 2018-07-25 17:12:59,484 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00013) failed with exit-code=1 2018-07-25 17:12:59,485 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00008) failed with exit-code=1 2018-07-25 17:12:59,485 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00020) failed with exit-code=1 2018-07-25 17:12:59,485 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00017) failed with exit-code=1 2018-07-25 17:12:59,485 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00014) failed with exit-code=1

ecoli.cfg:

[General]
# list of files of the initial subread fasta files
input_fofn = input.fofn

input_type = raw
#input_type = preads

# The length cutoff used for seed reads used for initial mapping
length_cutoff = 12000

# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 12000

# overlapping options for Daligner
pa_HPCdaligner_option =  -v -dal4 -M32 -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -dal4 -M32 -h60 -e.96 -l500 -s1000

pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50

# error correction consensus options
falcon_sense_option = --output-multi --min-idt 0.70 --min-cov 4 --max-n-read 200 --n-core 6

# overlap filtering options
overlap_filtering_setting = --max-diff 100 --max-cov 100 --min-cov 20 --bestn 10

# For job-submission options, see https://github.com/PacificBiosciences/FALCON/wiki/Configuration
# These are old-style, but should still work, for now.

# Cluster queue setting
#sge_option_da = -pe smp 8 -q jobqueue
#sge_option_la = -pe smp 2 -q jobqueue
#sge_option_pda = -pe smp 8 -q jobqueue
#sge_option_pla = -pe smp 2 -q jobqueue
#sge_option_fc = -pe smp 24 -q jobqueue
#sge_option_cns = -pe smp 8 -q jobqueue

# concurrency setting
#pa_concurrent_jobs = 32
#cns_concurrent_jobs = 32
#ovlp_concurrent_jobs = 32

#pwatcher_type = fs_based # the default
#job_type = SGE # the default

[job.defaults]
pwatcher_type = blocking
submit = /bin/bash -c "${JOB_SCRIPT}" > "${JOB_STDOUT}" 2> "${JOB_STDERR}"
njobs = 32

[job.step.da]
NPROC = 8
# Daligner needs only 4 procs per job, but since we set `-M32`, we need 32GB per job. If
# your Grid has roughly 4GB per processor, then we want to reserve 8 processors, in order to
# reserve 8GB*4==32GB of RAM per job.

[job.step.la]
NPROC = 2

[job.step.pda]
NPROC = 8

[job.step.pla]
NPROC = 2

[job.step.cns]
NPROC = 6 # also to pass --n-core=6 to falcon_sense

[job.step.asm]
NPROC = 24 # also to pass --n-core=24 to ovlp_filter
pb-cdunn commented 6 years ago

If the warnings are fixed, then your config is basically fine. (If you point us to the "tutorials", we might be able to update those. But we can only update wikis and docs. We cannot touch the source-code in GitHub anymore, because of management decisions.)

2018-07-25 17:12:59,484 - pypeflow.simple_pwatcher_bridge:94 - ERROR - Task Node(0-rawreads/cns-runs/cns_00007) failed with exit-code=1

When you see a message like that, simply go into that directory and look for stderr/stdout. E.g. ./0-rawreads/cns-runs/cns_00007/...

apredeus commented 6 years ago

Hello Christopher,

thank you for the suggestions. The configs I've looked at are here: https://pb-falcon.readthedocs.io/en/latest/parameters.html#parameters

conchoecia commented 6 years ago

I am having this same problem - there are many resources for Falcon and I am confused which instructions are up-to-date and authoritative. For example I ran the tutorial here: https://pb-falcon.readthedocs.io/en/latest/tutorial.html#tutorial using a dockerized Falcon-unzip install based on these instructions https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries (seems to be the latest, right?). However I am getting the following error trace:

Traceback (most recent call last):
  File "/usr/local/bin/fc_run.py", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/falcon_kit/mains/run1.py", line 724, in main
    main1(argv[0], args.config, args.logger)
  File "/usr/local/lib/python2.7/site-packages/falcon_kit/mains/run1.py", line 60, in main1
    check_general_config(general_config, input_config_fn)
  File "/usr/local/lib/python2.7/site-packages/falcon_kit/mains/run1.py", line 40, in check_general_config
    raise Exception(msg)
Exception: Missing options.
We now require both "pa_daligner_option" (stage 0) and "ovlp_daligner_option" (stage 1),
which are automatically passed along to
  HPC.daligner
  HPC.TANmask
  HPC.REPmask

These can provide additional flags:
  pa_HPCdaligner_option
  pa_HPCTANmask_option
  ovlp_HPCdaligner_option
  pa_REPmask_code (-g/-c pairs for 3 iterations, e.g. '1,20;5,15;20,10')

... when I used the fc_run_ecoli_local.cfg available on the same tutorial page. I wasn't able to find the options ovlp_daligner_option and pa_daligner_option in the read the docs search, either. I am a first-time user of FALCON so maybe I have just missed something obvious. Thanks!

pb-cdunn commented 6 years ago

The configs I've looked at are here: https://pb-falcon.readthedocs.io/en/latest/parameters.html#parameters

Well, it's a big company. I can't update those, and others are busy. Those mainly apply to the Falcon as part of smrtlink, our official release with any Sequel machine.

Maybe you can update the Wiki as a community if you figure something out.

pb-cdunn commented 6 years ago

@conchoecia,

https://github.com/PacificBiosciences/FALCON-integrate/issues/186#issuecomment-416057531

Does that help? If so, I will try to get that added to readthedocs.

conchoecia commented 6 years ago

Hi @pb-cdunn, Thanks for your comments. I ended up just installing falcon in a conda env like you suggested here: https://github.com/PacificBiosciences/FALCON_unzip/issues/136#issuecomment-416057708. I also have the dockerized Falcon-unzip I mentioned in the same thread.

For both versions of falcon I replaced the relevant lines mentioned here: https://github.com/PacificBiosciences/FALCON-integrate/issues/186#issuecomment-416057531. And now I am getting a new error ending in:

  File "/home/dschultz/python/anaconda3/envs/falcon/lib/python2.7/site-packages/pypeflow/simple_pwatcher_bridge.py", line 361, in _refreshTargets
    raise Exception(msg)
Exception: Some tasks are recently_done but not satisfied: set([Node(0-rawreads/build)])

I'm attaching a config files that I based on this one with the updates from https://github.com/PacificBiosciences/FALCON_unzip/issues/136#issuecomment-416057708. This config generated the same error in both the python version and the dockerized version.

all.log github2.cfg.txt

jasmynp commented 6 years ago

Using the latest binaries under 2018.08.08-21.41, I am now having issue with the daligner not liking the parameter -M but all documentation and examples are still using this parameter. Did this get replaced recently?

`which HPC.TANmask

The command help: `Usage: HPC.TANmask [-v] [-k<int(12)>] [-w<int(4)>] [-h<int(35)>] [-T<int(4)>] [-P<dir(/tmp)>] [-n<name(tan)>] [-e<double(.70)] [-l<int(500)>] [-s<int(100)] [-f] <reads:db|dam> [[-]

 Passed through to datander.
  -k: k-mer size (must be <= 32).
  -w: Look for k-mers in averlapping bands of size 2^-w.
  -h: A seed hit if the k-mers in band cover >= -h bps in the targest read.

  -e: Look for alignments with -e percent similarity.
  -l: Look for alignments of length >= -l.
  -s: Use -s as the trace point spacing for encoding alignments.

  -T: Use -T threads.
  -P: Do first level sort and merge in directory -P.

 Passed through to TANmask.
  -l: minimum tandem mask interval to report.
  -n: use this name for the tandem mask track.`
pb-cdunn commented 6 years ago
Exception: Some tasks are recently_done but not satisfied: set([Node(0-rawreads/build)])

@conchoecia , we need to see the stack-trace from the stderr file somewhere under the 0-rawreads/build/ directory.

pb-cdunn commented 6 years ago

@jasmynp , oops, I put -M on the wrong option. I've update [my comment(https://github.com/PacificBiosciences/FALCON-integrate/issues/186#issuecomment-416057531).

The idea is that pa_daligner_option must contain only the options which are common to daligner/datander/REPmask.

hollandorange commented 6 years ago

@pb-cdunn i still have problems in setting the cfg parameters, i posted it in PacificBiosciences/FALCON-integrate#186 (comment), please have a look. thanks~

conchoecia commented 6 years ago

@pb-cdunn - it appears that there is no stderr file in the 0-rawreads/build/ directory:

0-rawreads/build]$ ls
pwatcher.dir  run.sh  task.json  task.sh  template.sh  top.txt
pb-cdunn commented 6 years ago

@conchoecia, You are still using pwatcher= fs_based, rather than =blocking. So your stderr is under the pwatcher.dir directory.

@hollandorange, You are now using =blocking, which is good. But you have old code.

conchoecia commented 6 years ago

Hi @pb-cdunn. I am not sure what that means, but it is not something specified in any cfg files that I have seen. This is using the version of falcon that is installed via conda install pb-assembly.

gconcepcion commented 6 years ago

@conchoecia please try to run the example found here: https://github.com/gconcepcion/pb-assembly

The example that was part of the old readthedocs binary install package had an out of date fc_run.cfg for the latest version of falcon. The version I linked to in the documents above should have an up to date fc_run.cfg

pb-cdunn commented 6 years ago

@conchoecia , your stderr is under the pwatcher.dir directory in the directory of the failed task.

conchoecia commented 6 years ago

Thanks for the pointer, @gconcepcion and the clarification, @pb-cdunn.

I tried running the 200kbp test run locally using the exact instructions here: https://github.com/gconcepcion/pb-assembly

git clone https://github.com/cdunn2001/git-sym.git
git clone https://github.com/pb-cdunn/FALCON-examples.git
cd FALCON-examples
../git-sym/git-sym update run/greg200k-sv2
cd run/greg200k-sv2
fc_run fc_run.cfg

This time the run proceeded much further than when I was using the readthedocs example cfg, but the run does not seem to have completed. The final few lines written to the terminal were:

[INFO]recently_satisfied:
set([Node(1-preads_ovl/db2falcon)])
[INFO]Num satisfied in this iteration: 1
[INFO]Num still unsatisfied: 1
[INFO]About to submit: Node(2-asm-falcon)
[INFO]Popen: 'bash -C anaconda3/envs/falcon/lib/python2.7/site-packages/pwatcher/mains/job_start.sh >| FALCON-examples/run/greg200k-sv2/2-asm-falcon/run-Pf4eea4f942f137.bash.stdout 2>| FALCON-examples/run/greg200k-sv2/2-asm-falcon/run-Pf4eea4f942f137.bash.stderr'
[INFO](slept for another 0.5s -- another 4 loop iterations)
[INFO](slept for another 2.5s -- another 5 loop iterations)
[INFO]recently_satisfied:
set([Node(2-asm-falcon)])
[INFO]Num satisfied in this iteration: 1
[INFO]Num still unsatisfied: 0
[WARNING]CD: '0-rawreads' <- 'FALCON-examples/run/greg200k-sv2'
[WARNING]CD: '0-rawreads' -> 'FALCON-examples/run/greg200k-sv2'

I checked greg200k-sv2/2-asm-falcon/run-Pf4eea4f942f137.bash.stderr as that seemed to be where the last process was, and these were the last few lines:

# Output the contig graph with associate contigs attached to each primary contig.
time python -m falcon_kit.mains.gen_gfa_v2 contig.gfa.json >| contig.gfa2
+ python -m falcon_kit.mains.gen_gfa_v2 contig.gfa.json
falcon-kit 1.2.2
pypeflow 2.0.4

real    0m0.491s
user    0m1.404s
sys     0m4.064s

#rm -f ./preads4falcon.fasta

touch falcon_asm_done
+ touch falcon_asm_done

date
+ date
2018-09-05 12:18:29,177 - root - DEBUG - Call '/bin/bash user_script.sh' returned 0.
2018-09-05 12:18:29,177 - root - WARNING - CD: 'FALCON-examples/run/greg200k-sv2/2-asm-falcon' -> 'FALCON-examples/run/greg200k-sv2/2-asm-falcon'
2018-09-05 12:18:29,177 - root - DEBUG - Checking existence of u'falcon_asm_done' with timeout=30
2018-09-05 12:18:29,177 - root - WARNING - CD: 'FALCON-examples/run/greg200k-sv2/2-asm-falcon' -> 'FALCON-examples/run/greg200k-sv2/2-asm-falcon'

real    0m7.679s
user    0m18.904s
sys     0m42.440s
touch FALCON-examples/run/greg200k-sv2/2-asm-falcon/run.sh.done
+ touch FALCON-examples/run/greg200k-sv2/2-asm-falcon/run.sh.done
+ finish
+ echo 'finish code: 0'

And these are the files now present in the greg200k-sv2/ dir: 0-rawreads 1-preads_ovl 2-asm-falcon all.log config.json data fc_run.cfg fc_unzip.cfg General_config.json input_bam.fofn input.fofn makefile README.md.

Any idea of what is going on? The error is not clear to me from the stderr file. It looks like the process completed normally. Thank you! Getting closer to being able to run Falcon locally...

pb-cdunn commented 6 years ago

Are you sure you have an error? It looks like it finished. 2-asm-falcon definitely succeeded ('finish code: 0`). Unless you want to run Unzip/Phasing, that's the final task, so I think you're done!

(We should drop the "warnings" for directory changes, and we should print a joyous message at the conclusion.)

gconcepcion commented 6 years ago

@conchoecia Looks to me like the assembly pipeline completed successfully.

Is there a p_ctg.fa file in the 2-asm-falcon directory that has a file size in the 200Kb range?

conchoecia commented 6 years ago

Yep, I had only run fc_run.py on the initial assembly, and had not yet run the unzip config. Everything worked fine! Thank you both for your help. I really was stuck!

If you don't mind, I may PR the README.md for the pb-assembly repo to help clarify a few things for newcomers to falcon like myself.

gconcepcion commented 6 years ago

@conchoecia feel free to submit a PR if you want to clarify the README.md

apredeus commented 5 years ago

Since this thread seems to be the source of most up-to-date info on configuration, I might as well post it here. If running locally (128 core node with 512 Gb of RAM), what is the logic with NPROC/MB/njobs settings one should follow? Am I right to understand that NPROC=8, MB=32000 and njobs=16 would allow up to 16 jobs, each running on 8 cores and using 32 Gb of RAM?