PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

Recommendations for increasing assembly length #11

Closed johnomics closed 9 years ago

johnomics commented 9 years ago

Following up on https://twitter.com/johnomics/status/557816525923811328

I'm trying to use FALCON to assemble a small amount of PacBio data from the butterfly Heliconius melpomene. The genome size estimate is 292 Mb (from flow cytometry). We have ~20x coverage with P4/C2, corrected with PBcR using a mixture of Illumina and 454 data (PacBio data is available here: http://www.ebi.ac.uk/ena/data/view/ERP005954). All the sequence has come from a partially inbred strain; roughly two thirds of the genome is still heterozygous. I am using the PacBio data to scaffold our existing genome assembly, so want to maximise length over basepair quality for the PacBio assembly.

I have tried a range of FALCON assemblies, summarised below. The results are very impressive, especially considering our limited data, and I'm willing to just stick with what we've got. However, the default Celera assembly produced by PBcR is 407 Mb long (22k scaffolds, N50 32 kb). I assume a lot of this is haplotype sequence, but even so it suggests that FALCON may be rejecting some parts of the genome during assembly. So I'm wondering if there's a way to tweak FALCON to include more sequence. Here's what I've tried so far, just including the parameters I've changed:

Version length_cutoff length_cutoff_pr max_diff max_cov min_cov ovlp Daligner -l ovlp Daligner -s Mean Read Length Read Bases (bn) Assembly Size (Mb) Scaffolds N50 (kb)
1 3000 1200 20 30 2 500 1000 3,514 3.75 239.7 6,751 78
2 500 500 20 30 2 500 1000 2,644 4.17 239.9 6,843 78
3 500 500 40 60 1 500 1000 2,644 4.17 249.5 7,127 77
4 500 500 40 60 1 350 500 2,644 4.17 248.8 7,027 80
5 500 500 50 100 1 350 500 2,644 4.17 250.9 7,073 80

Any suggestions for further improvements that might push assembly length up to 290-300 Mb? Or is this likely to be the best we can do?

johnomics commented 9 years ago

Looks like https://github.com/PacificBiosciences/FALCON/commit/8c870d65e2edc2beb76921a44cb5555db3b9d95d might help. Will retry and report back.

johnomics commented 9 years ago
Version length_cutoff length_cutoff_pr max_diff max_cov min_cov ovlp Daligner -l ovlp Daligner -s Mean Read Length Read Bases (bn) Assembly Size (Mb) Scaffolds N50 (kb)
6 500 500 50 100 1 350 500 2,644 4.17 296.5 9,649 83

Exactly the length I was hoping for, in fewer scaffolds than our existing draft assembly. Excellent. Thanks for the bug fix!