Closed johnomics closed 9 years ago
Looks like https://github.com/PacificBiosciences/FALCON/commit/8c870d65e2edc2beb76921a44cb5555db3b9d95d might help. Will retry and report back.
Version | length_cutoff | length_cutoff_pr | max_diff | max_cov | min_cov | ovlp Daligner -l | ovlp Daligner -s | Mean Read Length | Read Bases (bn) | Assembly Size (Mb) | Scaffolds | N50 (kb) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 500 | 500 | 50 | 100 | 1 | 350 | 500 | 2,644 | 4.17 | 296.5 | 9,649 | 83 |
Exactly the length I was hoping for, in fewer scaffolds than our existing draft assembly. Excellent. Thanks for the bug fix!
Following up on https://twitter.com/johnomics/status/557816525923811328
I'm trying to use FALCON to assemble a small amount of PacBio data from the butterfly Heliconius melpomene. The genome size estimate is 292 Mb (from flow cytometry). We have ~20x coverage with P4/C2, corrected with PBcR using a mixture of Illumina and 454 data (PacBio data is available here: http://www.ebi.ac.uk/ena/data/view/ERP005954). All the sequence has come from a partially inbred strain; roughly two thirds of the genome is still heterozygous. I am using the PacBio data to scaffold our existing genome assembly, so want to maximise length over basepair quality for the PacBio assembly.
I have tried a range of FALCON assemblies, summarised below. The results are very impressive, especially considering our limited data, and I'm willing to just stick with what we've got. However, the default Celera assembly produced by PBcR is 407 Mb long (22k scaffolds, N50 32 kb). I assume a lot of this is haplotype sequence, but even so it suggests that FALCON may be rejecting some parts of the genome during assembly. So I'm wondering if there's a way to tweak FALCON to include more sequence. Here's what I've tried so far, just including the parameters I've changed:
Any suggestions for further improvements that might push assembly length up to 290-300 Mb? Or is this likely to be the best we can do?