About get_rdata.py - Githubissues

wyim-pgl commented 10 years ago

Hi!

I try to use FALCON for our species.

How can I define group_ID?

I've got 14 queries and do I need to run get_rdata.py 14 times?

Thank you.

Won

l 1-dist_map-falcon/

drwxr-xr-x 16 wyim 32K Jun 23 12:53 ./ drwxr-xr-x 10 wyim 32K Jun 23 17:03 ../ -rw-r--r-- 1 wyim 0 Jun 16 18:41 gather_target_done drwxr-xr-x 2 wyim 32K Jun 21 02:31 q00001_md/ drwxr-xr-x 2 wyim 32K Jun 21 00:10 q00002_md/ drwxr-xr-x 2 wyim 32K Jun 21 02:22 q00003_md/ drwxr-xr-x 2 wyim 32K Jun 20 23:56 q00004_md/ drwxr-xr-x 2 wyim 32K Jun 21 02:27 q00005_md/ drwxr-xr-x 2 wyim 32K Jun 20 23:48 q00006_md/ drwxr-xr-x 2 wyim 32K Jun 21 00:03 q00007_md/ drwxr-xr-x 2 wyim 32K Jun 21 02:35 q00008_md/ drwxr-xr-x 2 wyim 32K Jun 20 23:52 q00009_md/ drwxr-xr-x 2 wyim 32K Jun 21 02:39 q00010_md/ drwxr-xr-x 2 wyim 32K Jun 21 00:07 q00011_md/ drwxr-xr-x 2 wyim 32K Jun 20 23:59 q00012_md/ drwxr-xr-x 2 wyim 32K Jun 21 00:16 q00013_md/ drwxr-xr-x 2 wyim 32K Jun 21 02:24 q00014_md/ -rw-r--r-- 1 wyim 1.4K Jun 23 12:53 queries.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00001.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00002.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00003.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00004.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00005.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00006.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00007.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00008.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00009.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00010.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00011.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00012.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00013.fofn -rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00014.fofn -rw-r--r-- 1 wyim 0 Jun 16 18:41 split_fofn_done -rw-r--r-- 1 wyim 500 Jun 16 18:41 target_00001.fofn -rw-r--r-- 1 wyim 500 Jun 16 18:41 target_00002.fofn -rw-r--r-- 1 wyim 400 Jun 16 18:41 target_00003.fofn -rw-r--r-- 1 wyim 1.4K Jun 23 12:52 target.fofn

pb-jchin commented 10 years ago

No. If you use the latest HBAR_WF2.py or HBAR_WF3.py to drive the assembly work. There is a parameter preassembly_num_chunk that controls how many pre-assembly consensus jobs will be submitted. The get_rdata.py script should collect all data from 1-dist_map-falcon directory and does its own partitioning.

wyim-pgl commented 10 years ago

Thank you for your answer.

I did run HBAR_WF3.py and there's nothing in 3-asm-falcon/ folder.

So I try to run for i in {0..15}; do

"get_rdata.py ./0-fasta_files/queries.fofn ./0-fasta_files/targets.fofn ./2-preads-falcon/m4_files.fofn 72 ${i} 16 8 64 50 50 | falcon_wrap.py > p-reads-${i}.fasta"

done

I used preassembly_num_chunk = 8 but it generate p-reads-1.fasta to p-reads-15.fasta.

I also ran it with different group no. for i in {16..24}; do

"get_rdata.py ./0-fasta_files/queries.fofn ./0-fasta_files/targets.fofn ./2-preads-falcon/m4_files.fofn 72 ${i} 16 8 64 50 50 | falcon_wrap.py > p-reads-${i}.fasta"

done

But there are 0 byte files were generated.

That's why I ask about this.

Thank you.

Won

pb-jchin commented 10 years ago

Hi, Won:

do you see pa*.sh in the directory 2-preads-falcon? The 6th positional argument control the number of partition the get_rdata uses. In your case, you set it up to 16, so you get 0-15. do you get data in p-reads-1.fasta to p-reads-15.fasta.?

wyim-pgl commented 10 years ago

Hi Jason,

The only thing I can see in that folder is m4 file list.

[wyim@fnode2 kl]$ ll 2-preads-falcon/ total 96K drwxr-xr-x 2 wyim 32K Jun 23 17:29 ./ drwxr-xr-x 10 wyim 32K Jun 23 19:51 ../ -rw-r--r-- 1 wyim 854 Jun 21 02:39 m4_files.fofn

I've got data in in p-reads-1.fasta to p-reads-15.fasta.

Thank you.

Won

wyim-pgl commented 10 years ago

Dear Chin,

I ran Falcon assembler but it generate too small size contigs.

My preads.fasta has 419Mb and preads.ovlp has 315Mb.

But unitigs.fa has 35 Mb.

Do I need some other parameter?

I ran it as default.

Won

pb-jchin commented 10 years ago

if the genome size is about 20Mb, 419M is a bout 20.5x... typically the untigs.fa is about twice of the final assembly size. What is the expected genome size? and what is the assembly N50?

wyim-pgl commented 10 years ago

Our expected genome size is 500Mb. I didn't check N50 yet.

pb-jchin commented 10 years ago

then what happened is your coverage is < 1x, so there is only sparse overlapping, the code filter out contigs with less than certain number of support. When the initial set is < 1x, this is what expected.

wyim-pgl commented 10 years ago

We assume our genome is tetra ploidy and we have 20 cells of P4C2 and 20cells of P5C3.

Is there any coverage cutoff option in HBAR_WF3.py?

I used bestn = 36 in HBAR.cfg and 24 for BLASR.

wyim-pgl commented 10 years ago

And by using smartanalysis, it generated 380Mb.

I think the coverage is enough.

pb-jchin commented 10 years ago

In general, you will mostly need at least 6x ( somewhere around 12x to 20x might be the best ) pre-assembled reads for generating assembly from shot gun data. The exact number to get good results depending on many factors, e.g. read length distribution and genome complexity. The question is not a software issue in nature, so I will close this issue for now. You can check out some of the tutorial for how to get good assembly: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads

While the code used may be different, the fundamental statistical issue for getting assembly is mathematical.

This page has some useful information too. https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/De-Novo-Assembly

PacificBiosciences / FALCON

About get_rdata.py #2