E.coli output - Githubissues

juroscito commented 9 years ago

Hi, I'm trying out FALCON with the E.coli test data but I seem to get a different output -i.e. different size assembly- when compared to the numbers you published on twitter a few days ago (4,631,625b). I tried both config files from the 'examples' folder (fc_run_ecoli.cfg and fc_run_ecoli_2.cfg) and get different numbers for both (4,631,535b and 4,631,559b respectively).

I'm running Falcon on a CentOS 6.3 machine, and currently going for the job_type = local instead of SGE. The only message I get on the terminal, throughout the entire process, is 'No target specified, assuming "assembly" as target'. I also couldn't find any relevant .log files.

Anyways, even though the difference is minor, I find it still important to understand where this variability comes from. Who knows if and how it will scale up for larger genomes...

With which .cfg file did you assembled the E.coli data (those 3 fasta files from dropbox)? Also, do you think that different versions of Python, DAZZ_DB adn DALIGNER could account for this difference?

Following this, it would be really great if: 1) you could add a section to the manual with the expected results of the E.coli test; 2) add a more detailed description of which files one should generally expect as output (for example, the a_ctg files), and what do they represent.

Many thanks, Juliana

pb-jchin commented 9 years ago

Juliana, Thanks for checking it out. While the code is written to be deterministic, the detail final assembly does depend on how to code is run. For example, the execution order of the jobs and the detailed parameters will affect the detail of the final results. Do you get the same result if you run it twice locally? (I think this is the case in my own test, but I will have to do it again to confirm.)

Technically, the code is deterministic. However, if the input data is process in different order in a distributed computation environment or split in different ways, you won't get exact the same results. For fully reproducibility, one should consider to compare the results after running Quiver. The difference should be smaller the what is possible for consensus algorithm to handle.

I hope this helps.

I like your suggestion to put the e. coli test results. I will do it. (I will leave this issue open.)

AnWD commented 9 years ago

hi jchin,what are all the output files' meanings or functions?It seems that the introduction in the manual is not enough. what's more? when I falcon about 5G fasta files of human liver cancer with fc_run_arab.cfg, I got only 1G output files. And only the raw_reads.bps is 900M.And the last line of log said "No target specified, assuming "assembly" as target'. I wonder falcon has not been completed. thank you.

pb-jchin commented 9 years ago

Hi, AnWD, this repository issue tracker is mainly focusing on software related issues. For fundamental bioinformatics questions, (e.g. experimental design for cancer sequencing), is not easy to be discussed here. Many thing depends on your specific scientific goal. If you are a PacBio customer, you might be able to participate in some bioinformatics training provided by PacBio. I think that will help you more than discussing that here. You can also consider SeqAnswer or other Chinese or English discussion forums too.

dgordon562 commented 9 years ago

Like AnWD, I'm not clear from reading falcon_manual.md what is the final output of the assembly, the fasta file of the contigs? Is it 2-asm-falcon/p_ctg.fa or is it 2-asm-falcon/a_ctg.fa ?

thanks!

pb-jchin commented 9 years ago

Before we have time to write it up, this slide deck has the idea about the "primary contigs / p_ctg.fa" and "associated contigs / a_ctg.fa", https://speakerdeck.com/jchin/string-graph-assembly-for-diploid-genomes-with-long-reads

This is somehow a "new" concept in the direction moving toward for a full diploid assembler. If you have a haploid genome, you can most stay with p_ctg.fa. However, in that case, if you have sequences in a_ctg.fa, it might be caused by some ambiguity from the assembly graph. It can be used to indicate where the assembly might need some special attention.

pb-cdunn commented 9 years ago

@juroscito, we will add more details to the documentation over time. Feel free to update the wiki.

For repeatability, try FALCON-integrate.

cd FALCON-integrate
make init
make virtualenv
make -j install
make test
cd FALCON-examples
make run-ecoli

pb-cdunn commented 9 years ago

Also, the "synth0" test in FALCON-examples* is fully repeatability. It use error-free fake reads from a small fake circular genome. It actually diffs the result against the fake genome. We'll create a fake diploid example eventually.

Jason wants this to remain open to remind him to update the docs for E.coli.

PacificBiosciences / FALCON

E.coli output #17