Closed gabyrech closed 8 months ago
Rather ironically the older pipeline used Canu and this was then updated to Flye. My impression is that this decision was taken in order to ensure compatibility of the entire pipeline with ARM64 processors. Currently Canu does not have support for ARM64 or it requires a complex workaround.
Hi, thank you for taking the time to look in to this. When you say fails for short plasmids, approx what length is that? As mentioned by @mcrone we did previously use Canu assembler but switched it so we could support ARM64 and our internal tests showed it to return the same assemblies but we will investigate your feedback and see if it would makes sense to provide an additional Canu option.
Thanks @sarahjeeeze and @mcrone ! I see, that makes sense. It fails for ~1.8Kb plasmids. I can provide the fastq files if that helps.
Thanks , yes it would be great if you would not mind sharing the data you have that exemplifies the errors you are seeing/solution with Canu. We are planning to investigate this but it may not be for a few weeks.
Sure, I can share the fastq files.
Hi, sorry for missing this. Did you manage to share these with us in the end?
Hi Sarah! Yes, I did! I sent them by email on Nov 20. Please, let me know if you can't find them.
So for the delay, this is still on our to-do list
Are there any updates concerning this? I'm also having issues with the Flye assembly failing in the workflow. The following dataset (plasmid size 4.8m) also fails to assemble when running the clone validation workflow: https://figshare.com/articles/dataset/Ecoli_K12_MG1655_R10_3_HAC_11823087
(This is the example dataset for nanopore reads referenced in the Canu quick start guide)
Hi, Thanks for letting us know. I am currently adding Canu as an option - it will be in the next release (6th March). @gabyrech your comment about removing the reconciliation step and subsampling, we do this to use trycycler so planning to leave that in with the added Canu tool option for now, let me know if you have any concerns.
Hi @sarahjeeeze , that's great! thank you very much for considering (and implementing) this option! I know it might take quite a lot of work.
Regarding the subsampling and reconciliation steps, my major concern was in terms of reproducibility. Because there is some stochasticity and indeterminism associated with these process, you might get similar, but slightly different consensus sequences when running the workflow with the exact same parameters and input data. This is, however, just speculation for now, I haven't check the real impact.
The trycycler paper may be worth a read which found that using trycycler/finding consensus between multiple assemblies led to assemblies that were consistently more accurate. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02483-z We have found it to work well with our test sets however if you do come across a set which seems to output inconsistent assemblies we would be interested in it, we are looking in to broadening our test set for the work flow.
Closing as this is now released in v1.2.0
Is your feature related to a problem?
I've found the current assembly strategy "Downsampling --> Subsampling --> Flye x3 --> Deconcatenation --> Reconciliation --> Polishing" is struggling at different levels:
1) Flye fails sometimes to reconstruct assembly. Specially for short plasmids.
2) When you have a barcoded run, it might happen that you obtain different #reads (and coverage) from each sample. Current approach allows to set one single _assmcoverage value for all samples, which causes that some samples with slightly lower coverage "Failed due to insufficient reads" or "Failed to Subset reads".
3) Reproducibility: From Downsampling to Reconciliation most steps may challenge reproducibility (aka: running exactly the the same pipeline twice with the same input can result in different assemblies).
Describe the solution you'd like
From a couple of test on my side, Canu was able to reconstruct plasmids sequences in which the current strategy failed. I just used default parameters in Canu with the complete set of raw reads of the sample as input. Besides successfully reconstructing the plasmids sequence, this approach has another advantages:
It worth mention this recent benchmarking study in which the authors found Canu to outperform Flye for plasmids assembly, specially short plasmids: Johnson J, Soehnlen M, Blankenship HM. Long read genome assemblers struggle with small plasmids. Microb Genom. 2023 May;9(5):mgen001024. doi: 10.1099/mgen.0.001024. PMID: 37224062; PMCID: PMC10272865.
Describe alternatives you've considered
Maybe having the option to chose as a parameter in the pipeline whether you want to use Flye workflow with all the current steps or use Canu.
Another alternative could be having the workflow try first the current Flye workflow and if it fails, try with Canu.
Additional context
I understand this is a big request, and it might imply a big effort to accommodate the current code to add this feature, but I think it might worth the effort, since it could add robustness to the workflow by resolving those cases in which Flye is currently struggling (at least most of them) Thanks! Gabriel