epi2me-labs / wf-clone-validation

Other
24 stars 18 forks source link

Considering adding an alterative assembly strategy using Canu assembler #35

Closed gabyrech closed 6 months ago

gabyrech commented 11 months ago

Is your feature related to a problem?

I've found the current assembly strategy "Downsampling --> Subsampling --> Flye x3 --> Deconcatenation --> Reconciliation --> Polishing" is struggling at different levels:

1) Flye fails sometimes to reconstruct assembly. Specially for short plasmids.

2) When you have a barcoded run, it might happen that you obtain different #reads (and coverage) from each sample. Current approach allows to set one single _assmcoverage value for all samples, which causes that some samples with slightly lower coverage "Failed due to insufficient reads" or "Failed to Subset reads".

3) Reproducibility: From Downsampling to Reconciliation most steps may challenge reproducibility (aka: running exactly the the same pipeline twice with the same input can result in different assemblies).

Describe the solution you'd like

From a couple of test on my side, Canu was able to reconstruct plasmids sequences in which the current strategy failed. I just used default parameters in Canu with the complete set of raw reads of the sample as input. Besides successfully reconstructing the plasmids sequence, this approach has another advantages:

It worth mention this recent benchmarking study in which the authors found Canu to outperform Flye for plasmids assembly, specially short plasmids: Johnson J, Soehnlen M, Blankenship HM. Long read genome assemblers struggle with small plasmids. Microb Genom. 2023 May;9(5):mgen001024. doi: 10.1099/mgen.0.001024. PMID: 37224062; PMCID: PMC10272865.

Describe alternatives you've considered

Maybe having the option to chose as a parameter in the pipeline whether you want to use Flye workflow with all the current steps or use Canu.

Another alternative could be having the workflow try first the current Flye workflow and if it fails, try with Canu.

Additional context

I understand this is a big request, and it might imply a big effort to accommodate the current code to add this feature, but I think it might worth the effort, since it could add robustness to the workflow by resolving those cases in which Flye is currently struggling (at least most of them) Thanks! Gabriel

mcrone commented 11 months ago

Rather ironically the older pipeline used Canu and this was then updated to Flye. My impression is that this decision was taken in order to ensure compatibility of the entire pipeline with ARM64 processors. Currently Canu does not have support for ARM64 or it requires a complex workaround.

sarahjeeeze commented 11 months ago

Hi, thank you for taking the time to look in to this. When you say fails for short plasmids, approx what length is that? As mentioned by @mcrone we did previously use Canu assembler but switched it so we could support ARM64 and our internal tests showed it to return the same assemblies but we will investigate your feedback and see if it would makes sense to provide an additional Canu option.

gabyrech commented 11 months ago

Thanks @sarahjeeeze and @mcrone ! I see, that makes sense. It fails for ~1.8Kb plasmids. I can provide the fastq files if that helps.

sarahjeeeze commented 10 months ago

Thanks , yes it would be great if you would not mind sharing the data you have that exemplifies the errors you are seeing/solution with Canu. We are planning to investigate this but it may not be for a few weeks.

gabyrech commented 10 months ago

Sure, I can share the fastq files.

sarahjeeeze commented 9 months ago

Hi, sorry for missing this. Did you manage to share these with us in the end?

gabyrech commented 9 months ago

Hi Sarah! Yes, I did! I sent them by email on Nov 20. Please, let me know if you can't find them.

sarahjeeeze commented 8 months ago

So for the delay, this is still on our to-do list

alexandergfuji commented 7 months ago

Are there any updates concerning this? I'm also having issues with the Flye assembly failing in the workflow. The following dataset (plasmid size 4.8m) also fails to assemble when running the clone validation workflow: https://figshare.com/articles/dataset/Ecoli_K12_MG1655_R10_3_HAC_11823087

(This is the example dataset for nanopore reads referenced in the Canu quick start guide)

sarahjeeeze commented 7 months ago

Hi, Thanks for letting us know. I am currently adding Canu as an option - it will be in the next release (6th March). @gabyrech your comment about removing the reconciliation step and subsampling, we do this to use trycycler so planning to leave that in with the added Canu tool option for now, let me know if you have any concerns.

gabyrech commented 7 months ago

Hi @sarahjeeeze , that's great! thank you very much for considering (and implementing) this option! I know it might take quite a lot of work.

Regarding the subsampling and reconciliation steps, my major concern was in terms of reproducibility. Because there is some stochasticity and indeterminism associated with these process, you might get similar, but slightly different consensus sequences when running the workflow with the exact same parameters and input data. This is, however, just speculation for now, I haven't check the real impact.

sarahjeeeze commented 7 months ago

The trycycler paper may be worth a read which found that using trycycler/finding consensus between multiple assemblies led to assemblies that were consistently more accurate. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02483-z We have found it to work well with our test sets however if you do come across a set which seems to output inconsistent assemblies we would be interested in it, we are looking in to broadening our test set for the work flow.

sarahjeeeze commented 6 months ago

Closing as this is now released in v1.2.0