Closed esolares closed 9 years ago
Hi, Edwin:
The code is designed for the target chunk is within about 2 to 6 SMRTCell data, namely, around 6 to 18 raw "subreads.fasta" files. (each SMRT cell has 3 subreads.fasta now). If you have big chuck it will run into the memory problem. Sometime I have one big fasta file to start with, I simply split the reads according the read ids. I typically shot for not using more than 32 G RAM.
The parallelization is done with Python's multiprocess modules with shared memory usage. The parallelization is done in two stages. The d_core control how many cores used for looking up the index. Since the index is big, while with shared memory, it seems to me there is some leak. So, the d_core should not be big. The n_core does the detail alignment, typically they uses very little memory. However, if you don't have a lot of memory (<48G), I will probably use d_core =1 and n_core = (number of virtual cores - 2) if you can use the whole host.
By the way, there is no plan to develop falcon_overlap further as Gene Myers' DALIGNER is much more efficient. I will keep it for a while just in case that, sometimes, it might be good to have some alternative for consistency checking.
Hi,
Thank you for your reply. I used DALIGNER and merged all the las files and ended up with 31 las files which I converted to preads.*.fa files and merged them. I don't see a step on DALIGNER yet that does the contig fasta file output yet, so I was told by Gene Myers to use your falcon_overlap.py to finish the assembly, but I'm running into the memory issues I noted above. The servers we have access to have 512GB of RAM, which is much greater than 48GB and 64 cores. Should I merge the 31 preads into 3 larger files? i.e. 1-10,11-20,21-31? or half, 1-15,16-31? I have run all the current DALIGNER steps currently available, all that is left is to merge them and generate contigs for a finished draft assembly.
Thank you, I greatly appreciate your help,
Edwin
Hi, If you have 512G ram machine (which is much larger than I typically use), I think the problem is definite the target (and query) sizes are too big. Can you check the size of the _t.fa and _q.fa files in 0-fasta_files directory? Do you start with fasta files? I will suggest using the first field, e.g., "m140329_03394842176...." as the keys to separate files. (All reads with the same prefix in one file.) If each file is about 150M, then you can use up to 24 files as the chunk size. (This is my setting for a recent assembly I did in our system which has smaller RAM.)
Hi, I have 42 unique fasta files with keys "mXXXXXX_XXX..." and each range from 141MB to 700MB. So then create a separate database for each fasta file? If so how would I consolidate each?
Thank you,
Edwin
You can start with just setting the t_chunk_size = 2, q_chunk_size = 4 to see if the computation goes through.
I'm sorry but where would I put t and q chunk size parameters? I tried in falcon_overlap.py but received an error for having unrecognized arguments. Is this for hgap or hbar?
Thank you.
Oh... you need to drive the pipeline with https://github.com/PacificBiosciences/HBAR-DTK/blob/master/src/HBAR_WF3.py It reads a configuration file.
[General]
input_fofn = input2.fofn
length_cutoff = 6000
length_cutoff_pr = 6000
RQ_threshold = 0.75
sge_option_dm = -pe smp 16 -q huasm
sge_option_qf = -pe smp 1 -q huasm
sge_option_pa = -pe smp 16 -q huasm
sge_option_fca = -pe smp 24 -q huasm
sge_option_qv = -pe smp 16 -q huasm
blasr_opt = -nCandidates 32 -minMatch 12 -maxLCPLength 15 -bestn 32 -minPctIdentity 75.0 -maxScore -1000 -nproc 12
qrm_opt = --min_len 500 --n_core 18 --d_core 2 --n_candidates 256 --max_candidates 192
SEYMOUR_HOME = /mnt/secondary/Smrtpipe/builds/Assembly_Mainline_Nightly_Archive /build470-116466/
bestn = 192
target = falcon_asm
preassembly_num_chunk = 16
q_chunk_size = 24 t_chunk_size = 24
tmpdir = /tmp
big_tmpdir = /tmp
min_cov = 8 max_cov = 192 trim_align = 75 trim_plr = 0
q_nproc = 16
concurrent_jobs = 24
ok thanks. I will look into this more, but I thought DALIGNER skipped the blasr alignment part, and DBLA_to_falcon with falcon_sense output the fasta files then overlap was done with falcon_overlap.
The code has totally refactored. (The v0.2.1 is merged to the mainline now and falcon_overlap is removed in the latest code.) In the new code, both overlapping steps are done with Gene Myers' daligner code. I like to close this issue for now. If there is new issue related to the newer code, it should be tracked separately.
Thank you,
Could you just please point me in the right direction where I can find some info on this? docs? examples?
a document on the assembly and some example is still under developing. Please check doc/falcon_manual.md
Hi, I'm trying to use Falcon (not in cluster) in local. I set the job_type to local an i comment the sge options. I get this error "ConfigParser.NoOptionError: No option 'sge_option_da' in section: 'General'".
Do you know what's the problem with my installation!!
Thank you
I uncomment sgeotions, and now i have this message " No target specified, assuming "assembly" as target"
Than you
Hi,
I have a 9.4GB preads fasta file. I have tried running the following paramaters: --d_core 3 --n_core --min_len 8500 preads.fa > preads.ovl
Where --n_core was set at 64, 32, 24, 20 and all failed with memory error: MemoryError: out of memory
I have tried running it on 16 cores but after a few days, the processes just stay idle with high memory usage for 72hours. Have you guys experienced this? I am able to successfully execute falcon_overlap and much smaller data sets but not this current one.
I am running this on a centos 6 box with 512GB RAM and 64 cores. the files are stored on nfs ssd raid array.
Thank you,
Edwin