epi2me-labs / wf-clone-validation

Other
23 stars 18 forks source link

Failed to resolve homopolymeric plasmid DNA sequences or poly A tail in plasmids #36

Closed chatla01 closed 6 months ago

chatla01 commented 7 months ago

Operating System

macOS

Other Linux

Ubuntu 20.04

Workflow Version

v0.3.1

Workflow Execution

EPI2ME cloud agent

EPI2ME Version

v0.3.1

CLI command run

I am running the EPI2ME Labs wf-clone-validationv0.3.1 on Mac as well as GridION. Homopolymeric regions more than 20 bp and poly A tails in plasmids are not resolved. I used RBK114.24 kit with 10.4.1 Flow cells used Dorado basecaller with super accurcy and On for modified baseses. I feed the fastq_pass folder to wf-clone-validationv0.3.1 on GridIOn and/or Macbook pro with M1 in either cases it is not resolving Homopolymeric regions more than 20 bp and poly A tails and if run multiples times different number of poly A are captured. I have attached some screenshots. Any suggestion to overcome this problem?

image image image

Workflow Execution - CLI Execution Profile

standard (default)

What happened?

Homopolymeric regions more than 20 bp and poly A tails in plasmids are not resolved

Screenshot 2023-11-13 at 11 56 32 AM

Relevant log output

!! Only displaying parameters that differ from the pipeline defaults !!
--------------------------------------------------------------------------------
If you use epi2me-labs/wf-clone-validation for your analysis please cite:
* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x
--------------------------------------------------------------------------------
This is epi2me-labs/wf-clone-validation vmain.
--------------------------------------------------------------------------------
Checking fastq input.
[82/7b3dab] Submitted process > pipeline:getParams
[9d/c6fb23] Submitted process > pipeline:medakaVersion
[8f/1d7816] Submitted process > pipeline:lookup_medaka_model (1)
[73/f00ffc] Submitted process > pipeline:getVersions
[23/fa8de2] Submitted process > fastcat (3)
[30/459d79] Submitted process > fastcat (2)
[c0/139de9] Submitted process > fastcat (4)
[12/279968] Submitted process > fastcat (1)
[f4/513910] Submitted process > pipeline:checkIfEnoughReads (1)
[05/d5b055] Submitted process > pipeline:checkIfEnoughReads (2)
[01/47a8fc] Submitted process > pipeline:checkIfEnoughReads (3)
[5e/60440a] Submitted process > pipeline:checkIfEnoughReads (4)
[62/6f917d] Submitted process > pipeline:assembleCore (1)
[57/152716] Submitted process > pipeline:assembleCore (2)
[f3/fed8eb] Submitted process > pipeline:assembleCore (3)
[df/14e69b] Submitted process > pipeline:assembleCore (4)
[99/36b2b0] Submitted process > pipeline:downsampledStats (1)
[01/545cc6] Submitted process > pipeline:downsampledStats (2)
[ec/c6ded7] Submitted process > pipeline:downsampledStats (3)
[25/989ecc] Submitted process > pipeline:downsampledStats (4)
[c4/4f9bc2] Submitted process > pipeline:medakaPolishAssembly (1)
[9b/b34cab] Submitted process > pipeline:medakaPolishAssembly (2)
[07/d2af86] Submitted process > pipeline:medakaPolishAssembly (3)
[6a/27ce30] Submitted process > pipeline:medakaPolishAssembly (4)
[63/1747ee] Submitted process > output (1)
[4b/943d0d] Submitted process > pipeline:findPrimers (1)
[d2/becda3] Submitted process > output (2)
[31/034199] Submitted process > pipeline:findPrimers (2)
[55/cd2312] Submitted process > pipeline:findPrimers (3)
[a4/c505f9] Submitted process > output (3)
[2f/db1fde] Submitted process > output (4)
[76/022dd0] Submitted process > pipeline:findPrimers (4)
[5b/526386] Submitted process > pipeline:runPlannotate (1)
[51/1836fe] Submitted process > pipeline:inserts
[2c/c7800a] Submitted process > pipeline:report (1)
[f9/8ce240] Submitted process > output (10)
[00/53a01e] Submitted process > output (5)
[42/9e2e1c] Submitted process > output (6)
[94/1d8978] Submitted process > output (8)
[42/fe04db] Submitted process > output (7)
[5e/5236b6] Submitted process > output (9)

Application activity log entry

N/A
cjw85 commented 7 months ago

Hi @chatla01,

The workflow creates the output sequence through the joint use of flye as an assembler and medaka as a consensus.

Both of these tools work only from information stored in fastq (not primary instrument data). Their ability to resolve homopolymers is therefore tied to the basecaller output, the evidence presented in the fastq files.

In the case of the workflow, you can obtain differing results from multiple runs as the code samples the input data in a random fashion. Typically this does not affect the result as information from random samples is generally consistent. However there is more variation in the output of the basecaller around long homopolymers; so random samples can behave differently.

You may wish to open a discussion on dorado to discuss improvements to the basecalling.

chatla01 commented 7 months ago

Thank you @Chris. I will look into dorado discussion.