Open potapovneb opened 5 months ago
Hi @potapovneb, many thanks for reaching out.
Sometimes, the computed value (or the failsafe 3000) is not suitable for various reasons.
Could you explain some of these reasons? Thanks!
Hi @julibeg,
Flye computes the minimum overlap value based on the observed read length distribution. I believe it takes N90 value for the minimum overlap. It works fine for most samples. In my case, I extract a small subset of the reads based on their read length (let's say any read from 4900 nt to 5100 nt). This is done to build plasmid assembly for a specific peak in the read length distribution (let's say 5000 nt in this case). In cases like this (when there is very little variation in read lengths), N90 value computed by flye is too high and assembly fails. Manually overriding --min-overlap
would be useful.
Hi @potapovneb, makes sense; thanks for the further information!
We will consider exposing flye's --min-overalap
parameter in a future release.
This is also critical for us and it triggers a bug: It seems like perhaps Flye calculates the minimum overlap based on N90, but also rounds UP to the nearest kb. This probably works for genomes or large constructs, but for smaller plasmids, this often causes a minimum overlap size that is larger than the entire template, especially if the library dosen't have many smaller reads (i.e. mostly linearised circular plasmid). We often see failed assemblies for (what I suspect is) this reason.
This is valuable input, thank you! Have you seen similar issues when running the workflow with Canu?
I haven't rigorously tested both solutions, but on the occasion where we see the failed assemblies, they are almost always resolved by Canu.
Thanks for letting us know, we know that sometimes Canu assembles fine where Flye fails for smaller plasmids but Canu does not work on mac arm which is why we offer both and have Flye as the default. Once Canu supports Arm which is in the pipeline we will consider changing the default to Canu. In the meantime we will look in to exposing min overlap.
We would much prefer to use Flye instead of Canu, because Canu (or something else in the pipeline) appears to regularly make small (<200bp) errors in the assemblies due to what we suspect is something to do with read trimming. But the current behaviour with rounding up to the nearest 1kb (if that's what's happening) prevents us from using the Flye option.
the min overlap for flye is 1000, it complains if you go lower with the error --min-overlap: value should be in the range [1000, 10000]
If you look at the flye repo I think its explained why somewhere. But for Canu mode if you set the trim_length parameter of the workflow to 0 do you still get the 200bp errors?
We saw another set of Canu assemblies that were ~200bp too short, and the --trim_length parameter seems to have solved the issue.
We'd still prefer to use flye though, since flye seems to do a better job in general. But, the min-overlap problem causes too many failures.
Is your feature related to a problem?
flye assembler is used to estimate a minimum read overlap automatically (when
--assembly_tool
is set toflye
). Sometimes, the computed value (or the failsafe3000
) is not suitable for various reasons.Describe the solution you'd like
I would be great to pass something like
--flye-min-overlap
to wf-clone-validation. This flag could be set to'auto'
or to an actual value specified by user.Describe alternatives you've considered
Manually editing main.nf in wf-clone-validation to a required value.
Additional context
No response