epi2me-labs / wf-clone-validation

Other
25 stars 19 forks source link

Adding flye's --min-overlap command line option as a param to wf-clone-valiudation #52

Open potapovneb opened 5 months ago

potapovneb commented 5 months ago

Is your feature related to a problem?

flye assembler is used to estimate a minimum read overlap automatically (when --assembly_tool is set to flye). Sometimes, the computed value (or the failsafe 3000) is not suitable for various reasons.

Describe the solution you'd like

I would be great to pass something like --flye-min-overlap to wf-clone-validation. This flag could be set to 'auto' or to an actual value specified by user.

Describe alternatives you've considered

Manually editing main.nf in wf-clone-validation to a required value.

Additional context

No response

julibeg commented 5 months ago

Hi @potapovneb, many thanks for reaching out.

Sometimes, the computed value (or the failsafe 3000) is not suitable for various reasons.

Could you explain some of these reasons? Thanks!

potapovneb commented 5 months ago

Hi @julibeg,

Flye computes the minimum overlap value based on the observed read length distribution. I believe it takes N90 value for the minimum overlap. It works fine for most samples. In my case, I extract a small subset of the reads based on their read length (let's say any read from 4900 nt to 5100 nt). This is done to build plasmid assembly for a specific peak in the read length distribution (let's say 5000 nt in this case). In cases like this (when there is very little variation in read lengths), N90 value computed by flye is too high and assembly fails. Manually overriding --min-overlap would be useful.

julibeg commented 4 months ago

Hi @potapovneb, makes sense; thanks for the further information! We will consider exposing flye's --min-overalap parameter in a future release.

scottcoutts commented 3 months ago

This is also critical for us and it triggers a bug: It seems like perhaps Flye calculates the minimum overlap based on N90, but also rounds UP to the nearest kb. This probably works for genomes or large constructs, but for smaller plasmids, this often causes a minimum overlap size that is larger than the entire template, especially if the library dosen't have many smaller reads (i.e. mostly linearised circular plasmid). We often see failed assemblies for (what I suspect is) this reason.

julibeg commented 3 months ago

This is valuable input, thank you! Have you seen similar issues when running the workflow with Canu?

scottcoutts commented 3 months ago

I haven't rigorously tested both solutions, but on the occasion where we see the failed assemblies, they are almost always resolved by Canu.

sarahjeeeze commented 3 months ago

Thanks for letting us know, we know that sometimes Canu assembles fine where Flye fails for smaller plasmids but Canu does not work on mac arm which is why we offer both and have Flye as the default. Once Canu supports Arm which is in the pipeline we will consider changing the default to Canu. In the meantime we will look in to exposing min overlap.

micromongenomics commented 1 month ago

We would much prefer to use Flye instead of Canu, because Canu (or something else in the pipeline) appears to regularly make small (<200bp) errors in the assemblies due to what we suspect is something to do with read trimming. But the current behaviour with rounding up to the nearest 1kb (if that's what's happening) prevents us from using the Flye option.

sarahjeeeze commented 1 month ago

the min overlap for flye is 1000, it complains if you go lower with the error --min-overlap: value should be in the range [1000, 10000] If you look at the flye repo I think its explained why somewhere. But for Canu mode if you set the trim_length parameter of the workflow to 0 do you still get the 200bp errors?

scottcoutts commented 1 month ago

We saw another set of Canu assemblies that were ~200bp too short, and the --trim_length parameter seems to have solved the issue.

We'd still prefer to use flye though, since flye seems to do a better job in general. But, the min-overlap problem causes too many failures.