Inconsistent output assemblies with the same input reads

zihengluo commented 2 years ago

What happened?

Hi, I ran the workflow for 6 times (3 times with v0.2.1, three times with v0.2.4) on the same input data, the output 6 assemblies are different. I wonder what factors caused the inconsistency?

Here is the the command I used: nextflow run epi2me-labs/wf-clone-validation -profile conda --fastq {input_path} --db_directory wf-clone-validation-db --out_dir {output_path}

Here is the overview of input data :

Screen Shot 2022-09-01 at 5 05 43 pm Screen Shot 2022-09-01 at 5 08 22 pm

I mapped the six output assemblies against the reference sequence and found the assemblies failed to fully recover the reference and lost different sequences in different runs. Screen Shot 2022-09-01 at 5 24 34 pm

However, the reference is fully covered by reads as indicated by reads mapping results. I used minimap2 without secondary mapping. Screen Shot 2022-09-01 at 5 25 35 pm

Operating System

ubuntu 20.04

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - Execution Profile

Conda

Workflow Version

0.2.1 & 0.2.4

Relevant log output

No error appeared

sarahjeeeze commented 2 years ago

Hi, that's interesting. Does your insert contain any repetitive sections or homopolymers? and what happens if you increase coverage with the param --assm_coverage default is 60. Nothing has changed between the versions that would impact the assembly.

sarahjeeeze commented 2 years ago

Oh also maybe change the param --approx_size to match your data, its currently set to 7000 .. maybe try 10000

thyagoleal commented 1 year ago

o maybe change the param --approx_size to match your data, its currently set to 7000 .. maybe try 10000

I'm having the same issue. I noticed that the number of reads differs significantly between the different versions (on the histogram), so I think some tools might have changed their default values. Maybe a cutoff is being passed somewhere by default. I dunno.

I saw these differences between latest version 0.2.12 and 0.2.8.

sarahjeeeze commented 1 year ago

Thanks for pointing this out. We have changed the cut off for the data we report in the raw data qc tab, we will look in to either restoring this or updating the changelog to explain the change.

The actual assembly method has stayed the same so it shouldn't impact the output assemblies? Are you working with multiple barcodes? And if so are they all approximately the same size? - How different are your output assemblies between versions?

thyagoleal commented 1 year ago

Thanks for pointing this out. We have changed the cut off for the data we report in the raw data qc tab, we will look in to either restoring this or updating the changelog to explain the change.

The actual assembly method has stayed the same so it shouldn't impact the output assemblies? Are you working with multiple barcodes? And if so are they all approximately the same size? - How different are your output assemblies between versions?

They're the same, it was just the plot.

sarahjeeeze commented 1 year ago

Hi, We have since updated the workflow to restore the raw data gc plots - showing just the raw unfiltered data.

sarahjeeeze commented 1 year ago

Closing as assuming issue resolved.

epi2me-labs / wf-clone-validation