PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

falcon_overlap on large datasets #5

Closed esolares closed 9 years ago

esolares commented 10 years ago

Hi,

I have a 9.4GB preads fasta file. I have tried running the following paramaters: --d_core 3 --n_core --min_len 8500 preads.fa > preads.ovl

Where --n_core was set at 64, 32, 24, 20 and all failed with memory error: MemoryError: out of memory

I have tried running it on 16 cores but after a few days, the processes just stay idle with high memory usage for 72hours. Have you guys experienced this? I am able to successfully execute falcon_overlap and much smaller data sets but not this current one.

I am running this on a centos 6 box with 512GB RAM and 64 cores. the files are stored on nfs ssd raid array.

Thank you,

Edwin

pb-jchin commented 10 years ago

Hi, Edwin:

The code is designed for the target chunk is within about 2 to 6 SMRTCell data, namely, around 6 to 18 raw "subreads.fasta" files. (each SMRT cell has 3 subreads.fasta now). If you have big chuck it will run into the memory problem. Sometime I have one big fasta file to start with, I simply split the reads according the read ids. I typically shot for not using more than 32 G RAM.

The parallelization is done with Python's multiprocess modules with shared memory usage. The parallelization is done in two stages. The d_core control how many cores used for looking up the index. Since the index is big, while with shared memory, it seems to me there is some leak. So, the d_core should not be big. The n_core does the detail alignment, typically they uses very little memory. However, if you don't have a lot of memory (<48G), I will probably use d_core =1 and n_core = (number of virtual cores - 2) if you can use the whole host.

By the way, there is no plan to develop falcon_overlap further as Gene Myers' DALIGNER is much more efficient. I will keep it for a while just in case that, sometimes, it might be good to have some alternative for consistency checking.

esolares commented 10 years ago

Hi,

Thank you for your reply. I used DALIGNER and merged all the las files and ended up with 31 las files which I converted to preads.*.fa files and merged them. I don't see a step on DALIGNER yet that does the contig fasta file output yet, so I was told by Gene Myers to use your falcon_overlap.py to finish the assembly, but I'm running into the memory issues I noted above. The servers we have access to have 512GB of RAM, which is much greater than 48GB and 64 cores. Should I merge the 31 preads into 3 larger files? i.e. 1-10,11-20,21-31? or half, 1-15,16-31? I have run all the current DALIGNER steps currently available, all that is left is to merge them and generate contigs for a finished draft assembly.

Thank you, I greatly appreciate your help,

Edwin

pb-jchin commented 10 years ago

Hi, If you have 512G ram machine (which is much larger than I typically use), I think the problem is definite the target (and query) sizes are too big. Can you check the size of the _t.fa and _q.fa files in 0-fasta_files directory? Do you start with fasta files? I will suggest using the first field, e.g., "m140329_03394842176...." as the keys to separate files. (All reads with the same prefix in one file.) If each file is about 150M, then you can use up to 24 files as the chunk size. (This is my setting for a recent assembly I did in our system which has smaller RAM.)

esolares commented 10 years ago

Hi, I have 42 unique fasta files with keys "mXXXXXX_XXX..." and each range from 141MB to 700MB. So then create a separate database for each fasta file? If so how would I consolidate each?

Thank you,

Edwin

pb-jchin commented 10 years ago

You can start with just setting the t_chunk_size = 2, q_chunk_size = 4 to see if the computation goes through.

esolares commented 10 years ago

I'm sorry but where would I put t and q chunk size parameters? I tried in falcon_overlap.py but received an error for having unrecognized arguments. Is this for hgap or hbar?

Thank you.

pb-jchin commented 10 years ago

Oh... you need to drive the pipeline with https://github.com/PacificBiosciences/HBAR-DTK/blob/master/src/HBAR_WF3.py It reads a configuration file.

[General]

list of files of the initial bas.h5 files

input_fofn = input2.fofn

The length cutoff used for seed reads used for initial mapping

length_cutoff = 6000

The length cutoff used for seed reads usef for pre-assembly

length_cutoff_pr = 6000

The read quality cutoff used for seed reads

RQ_threshold = 0.75

SGE job option for distributed mapping

sge_option_dm = -pe smp 16 -q huasm

SGE job option for m4 filtering

sge_option_qf = -pe smp 1 -q huasm

SGE job option for pre-assembly

sge_option_pa = -pe smp 16 -q huasm

SGE job option for falcon asm

sge_option_fca = -pe smp 24 -q huasm

SGE job option for Quiver

sge_option_qv = -pe smp 16 -q huasm

blasr for initial read-read mapping for each chunck (do not specific the "-out" option).

One might need to tune the bestn parameter to match the number of distributed chunks to get more optimized results

blasr_opt = -nCandidates 32 -minMatch 12 -maxLCPLength 15 -bestn 32 -minPctIdentity 75.0 -maxScore -1000 -nproc 12

qrm_opt = --min_len 500 --n_core 18 --d_core 2 --n_candidates 256 --max_candidates 192

This is used for running quiver

SEYMOUR_HOME = /mnt/secondary/Smrtpipe/builds/Assembly_Mainline_Nightly_Archive /build470-116466/

The number of best alignment hits used for pre-assembly

It should be about the same as the final PLR coverage, slight higher might be OK.

bestn = 192

target choices are "pre_assembly", "draft_assembly", "all"

"mapping": initial mapping

"pre_assembly" : generate pre_assembly for any long read assembler to use

"draft_assembly": automatic submit CA assembly job when pre-assembly is done

"all" : submit job for using Quiver to do final polish, not working yet

target = falcon_asm

number of chunks for distributed mapping

preassembly_num_chunk = 16

number of chunks for pre-assembly.

One might want to use bigger chunk data sizes (smaller dist_map_num_chunk) to

take the advantage of the suffix array index used by blasr

q_chunk_size = 24 t_chunk_size = 24

"tmpdir" is for preassembly. A lot of small files are created and deleted during this process.

It would be great to use ramdisk for this. Set tmpdir to a NFS mount will probably have very bad performance.

tmpdir = /tmp

"big_tmpdir" is for quiver, better in a big disk

big_tmpdir = /tmp

various trimming parameters

min_cov = 8 max_cov = 192 trim_align = 75 trim_plr = 0

number of processes used by by blasr during the preassembly process

q_nproc = 16

concurrent_jobs = 24

esolares commented 10 years ago

ok thanks. I will look into this more, but I thought DALIGNER skipped the blasr alignment part, and DBLA_to_falcon with falcon_sense output the fasta files then overlap was done with falcon_overlap.

pb-jchin commented 9 years ago

The code has totally refactored. (The v0.2.1 is merged to the mainline now and falcon_overlap is removed in the latest code.) In the new code, both overlapping steps are done with Gene Myers' daligner code. I like to close this issue for now. If there is new issue related to the newer code, it should be tracked separately.

esolares commented 9 years ago

Thank you,

Could you just please point me in the right direction where I can find some info on this? docs? examples?

cschin commented 9 years ago

a document on the assembly and some example is still under developing. Please check doc/falcon_manual.md

zine-el-aabidine commented 9 years ago

Hi, I'm trying to use Falcon (not in cluster) in local. I set the job_type to local an i comment the sge options. I get this error "ConfigParser.NoOptionError: No option 'sge_option_da' in section: 'General'".

Do you know what's the problem with my installation!!

Thank you

zine-el-aabidine commented 9 years ago

I uncomment sgeotions, and now i have this message " No target specified, assuming "assembly" as target"

Than you