Recommendations for increasing assembly length

johnomics commented 9 years ago

Following up on https://twitter.com/johnomics/status/557816525923811328

I'm trying to use FALCON to assemble a small amount of PacBio data from the butterfly Heliconius melpomene. The genome size estimate is 292 Mb (from flow cytometry). We have ~20x coverage with P4/C2, corrected with PBcR using a mixture of Illumina and 454 data (PacBio data is available here: http://www.ebi.ac.uk/ena/data/view/ERP005954). All the sequence has come from a partially inbred strain; roughly two thirds of the genome is still heterozygous. I am using the PacBio data to scaffold our existing genome assembly, so want to maximise length over basepair quality for the PacBio assembly.

I have tried a range of FALCON assemblies, summarised below. The results are very impressive, especially considering our limited data, and I'm willing to just stick with what we've got. However, the default Celera assembly produced by PBcR is 407 Mb long (22k scaffolds, N50 32 kb). I assume a lot of this is haplotype sequence, but even so it suggests that FALCON may be rejecting some parts of the genome during assembly. So I'm wondering if there's a way to tweak FALCON to include more sequence. Here's what I've tried so far, just including the parameters I've changed:

Version	length_cutoff	length_cutoff_pr	max_diff	max_cov	min_cov	ovlp Daligner -l	ovlp Daligner -s	Mean Read Length	Read Bases (bn)	Assembly Size (Mb)	Scaffolds	N50 (kb)
1	3000	1200	20	30	2	500	1000	3,514	3.75	239.7	6,751	78
2	500	500	20	30	2	500	1000	2,644	4.17	239.9	6,843	78
3	500	500	40	60	1	500	1000	2,644	4.17	249.5	7,127	77
4	500	500	40	60	1	350	500	2,644	4.17	248.8	7,027	80
5	500	500	50	100	1	350	500	2,644	4.17	250.9	7,073	80

Any suggestions for further improvements that might push assembly length up to 290-300 Mb? Or is this likely to be the best we can do?

johnomics commented 9 years ago

Looks like https://github.com/PacificBiosciences/FALCON/commit/8c870d65e2edc2beb76921a44cb5555db3b9d95d might help. Will retry and report back.

johnomics commented 9 years ago

Version	length_cutoff	length_cutoff_pr	max_diff	max_cov	min_cov	ovlp Daligner -l	ovlp Daligner -s	Mean Read Length	Read Bases (bn)	Assembly Size (Mb)	Scaffolds	N50 (kb)
6	500	500	50	100	1	350	500	2,644	4.17	296.5	9,649	83

Exactly the length I was hoping for, in fewer scaffolds than our existing draft assembly. Excellent. Thanks for the bug fix!

PacificBiosciences / FALCON

Recommendations for increasing assembly length #11