[Bug]: Workflow ends with Error status due to missing basecaller config

onkeld commented 1 year ago

What happened?

Workflow was set up in Epi2Me Labs Desktop app Version 4.0.0 on Windows 10.

The Samples were run on a Flongle Flow Cell with Pore Version 9.4.1 (FLO-FLG001). We selected the corresponding Basecaller Configuration from the Dorado models List as "dna_r9.4.1_e8_sup@v3.3" As we are only trying to call SNPs, we did not provide a Remora Basecalling Model.

The Workflow almost immediately stops with an Error.

Apparently, the Model "dna_r9.4.1_e8_sup@v3.3" is listed as available model, but still can't be opened because it is an unknown model.

Is there something wrong with our Epi2Me Labs installation or does dorado have problems downloading model files?

Any suggestions would be welcome. If you require further information, let us know.

Operating System

Windows 10

Workflow Execution

EPI2ME Labs desktop application

Workflow Execution - EPI2ME Labs Versions

4.0.0

Workflow Execution - CLI Execution Profile

None

Workflow Version

1.1.0

Relevant log output

Error executing process > 'basecalling:wf_dorado:dorado (2)'
Caused by:
  Process `basecalling:wf_dorado:dorado (2)` terminated with an error exit status (1)
Command executed:
  echo '***'
  echo 'Available models:'
  list-models | sed 's,^,- ,' | sed "s,${DRD_MODELS_PATH}/,,"
  echo '***'
  echo 'You selected:'
  echo "Basecalling model: dna_r9.4.1_e8_sup@v3.3"
  echo "Remora model     : null"
  echo '***'
  echo 'A file open error below indicates that you have entered an unknown model name.'
  echo 'It is possible the model you selected worked previously but has been updated to a new version.'
  echo 'Resubmit this workflow with an appropriate model from the model list above.'
  echo '***'

  dorado basecaller         ${DRD_MODELS_PATH}/dna_r9.4.1_e8_sup@v3.3 .                           --device cuda:all | samtools view -b -o 1.ubam -
Command exit status:
  1
Command output:
  ***
  Available models:
  - dna_r10.4.1_e8.2_260bps_fast@v4.0.0
  - dna_r10.4.1_e8.2_260bps_hac@v4.0.0
  - dna_r10.4.1_e8.2_260bps_sup@v4.0.0
  - dna_r10.4.1_e8.2_400bps_fast@v4.0.0
  - dna_r10.4.1_e8.2_400bps_fast@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.1_e8.2_400bps_hac@v4.0.0
  - dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.1_e8.2_400bps_sup@v4.0.0
  - dna_r10.4.1_e8.2_400bps_sup@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.2_e8.2_4khz_stereo@v1.0
  - dna_r9.4.1_e8_fast@v3.4
  - dna_r9.4.1_e8_fast@v3.4_5mCG@v0
  - dna_r9.4.1_e8_hac@v3.3
  - dna_r9.4.1_e8_hac@v3.4_5mCG@v0
  - dna_r9.4.1_e8_sup@v3.3
  - dna_r9.4.1_e8_sup@v3.4_5mCG@v0
  - rna003_120bps_sup@v3
  ***
  You selected:
  Basecalling model: dna_r9.4.1_e8_sup@v3.3
  Remora model     : null
  ***
  A file open error below indicates that you have entered an unknown model name.
  It is possible the model you selected worked previously but has been updated to a new version.
  Resubmit this workflow with an appropriate model from the model list above.
  ***
Command error:
  [2023-01-19 08:28:57.705] [info] > Creating basecall pipeline
  [main_samview] fail to read the header from "-".
Work dir:
  /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/instances/wf-human-variation_95c76bfa-7e4e-4dc0-ba63-59c939af11e7/work/84/d07806a3f690340099f689701db2e1
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
WARN: Killing running tasks (3)

SamStudio8 commented 1 year ago

Hi @onkeld, sorry for the confusion, that error message is printed generically to help users understand the case where a file open error is encountered by dorado which is not the case here. The fail to read the header from "-" is usually seen if the process has run out of memory, or if dorado has not output a header because the input data is invalid.

onkeld commented 1 year ago

I just reran the workflow checking the points mentioned. I don't see how I could have ran out of memory for Dorado as my Docker has been provided with 24 GB of memory in the configuration and during the run only reported a memory use up to 15 GB.

Could the input data be invalid because I provided .fast5 files instead of .pod5? The Workflow Setting "dorado_ext" was set to fast5.

Any suggestions as to how I could troubleshoot further?

SamStudio8 commented 1 year ago

Hi @onkeld, thanks for checking. Does the output file at /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/instances/wf-human-variation_95c76bfa-7e4e-4dc0-ba63-59c939af11e7/work/84/d07806a3f690340099f689701db2e1/1.ubam exist and have any data?

onkeld commented 1 year ago

I have in the meantime tried several other things. When manually converting my .fast5 files to .pod5 format and providing those as input via the "fast5 dir" in the Epi2Me Labs desktop app (slightly confusing to have a "fast5 dir" also accepting a directory with pod5, but whatever...), I still get the same results. Checking the directory you mentioned for the current run does not contain any .ubam files. I only find symlinks to all my .pod5 inputs and symlinks to "dorado_model" and "remora_model" wich both link to the same location: /mnt/c/Users/daz87im/epi2melabs/workflows/epi2me-labs/wf-human-variation/data/OPTIONAL_FILE

The same is true when providing .fast5 inputs. Just finding .fast5 and dorado_model and remora_model symlinks in the directory with the model symlinks pointing to the same file named "OPTIONAL_FILE".

onkeld commented 1 year ago

In my endeavours to further debug this issue, I tried to run the workflow with Demo Data from the Windows Client. This also fails. When starting the workflow with demo data, I immediately detect increased network traffic. I expect the workflow to be downloading the demo dataset from the internet. Shortly after, the workflow stops with an error "untar failed" pointing to the demo data directory.

I'm not really surprised that untar is failing here, as this directory is completely empty - there's nothing to untar.

I'm not sure what is happening here - either the download of the demo data does not work or the data gets deleted immediately after when the untar command fails.

My WSL-Ubuntu does provide the current version of the tar and gz packages, and also my Windows should be able to handle .tar.gz files - any suggestions as to what is happening there?

Is there a way to manually download the demo data for this workflow?

SamStudio8 commented 1 year ago

Hi @onkeld, sorry to hear that you've had a problem with downloading the demo data through our EPI2MELabs client, I've raised this internally. In the mean time you can download and untar the demo data as per the instructions in the quickstart section of the README. Please let me know how you get on running the demo data!

SamStudio8 commented 1 year ago

Ah @onkeld, the demo data for wf-human-variation does not require basecalling, you may wish to try the wf-basecalling workflow and its demo data to diagnose this. The demo data for wf-basecalling can be downloaded from https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-basecalling/wf-basecalling-demo.tar.gz.

onkeld commented 1 year ago

There must be something going on with my installation in general... When trying to launch any workflow using the Epi2Me Labs Windows Launcher with the "Use Demo Data" button, I get the same problem: After a sufficient time of high network activity, the workflow stops with the "untar failed" error message. The mentioned folder on my Windows Drive C is empty in all cases.

When I manually untar a set of demo data and feed it to the workflow through the Windows Launcher using the normal "Run this Workflow" option, I end up with the same error message as before:

N E X T F L O W  ~  version 22.04.5
Launching `/mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/workflows/epi2me-labs/wf-basecalling/main.nf` [frosty_shaw] DSL2 - revision: 823773aa8f
||||||||||   _____ ____ ___ ____  __  __ _____      _       _
||||||||||  | ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
|||||       |  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
|||||       | |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
||||||||||  |_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
||||||||||  wf-basecalling v0.2.0
--------------------------------------------------------------------------------
Core Nextflow options
  runName        : frosty_shaw
  containerEngine: docker
  launchDir      : /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/instances/wf-basecalling_266cf119-0d01-401e-8572-d5b3ad2fc34b
  workDir        : /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/instances/wf-basecalling_266cf119-0d01-401e-8572-d5b3ad2fc34b/work
  projectDir     : /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/workflows/epi2me-labs/wf-basecalling
  userName       : daniel
  profile        : standard
  configFiles    : /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/workflows/epi2me-labs/wf-basecalling/nextflow.config
Input Options
  input          : /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/demo/epi2me-labs/wf-basecalling/wf-basecalling-demo/input
  ref            : /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/demo/epi2me-labs/wf-basecalling/wf-basecalling-demo/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
Output Options
  out_dir        : /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/instances/wf-basecalling_266cf119-0d01-401e-8572-d5b3ad2fc34b/output
Basecalling options
  basecaller_cfg : dna_r10.4.1_e8.2_400bps_sup@v4.0.0
  dorado_ext     : pod5
!! Only displaying parameters that differ from the pipeline defaults !!
--------------------------------------------------------------------------------
If you use epi2me-labs/wf-basecalling for your analysis please cite:
* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x
--------------------------------------------------------------------------------
This is epi2me-labs/wf-basecalling v0.2.0.
--------------------------------------------------------------------------------
[78/cba9d9] Submitted process > getVersions
[4b/fded8a] Submitted process > wf_dorado:make_mmi
[31/f4790d] Submitted process > getParams
[17/c9974e] Submitted process > wf_dorado:dorado (1)
Error executing process > 'wf_dorado:dorado (1)'
Caused by:
  Process `wf_dorado:dorado (1)` terminated with an error exit status (1)
Command executed:
  echo '***'
  echo 'Available models:'
  list-models | sed 's,^,- ,' | sed "s,${DRD_MODELS_PATH}/,,"
  echo '***'
  echo 'You selected:'
  echo "Basecalling model: dna_r10.4.1_e8.2_400bps_sup@v4.0.0"
  echo "Remora model     : null"
  echo '***'
  echo 'A file open error below indicates that you have entered an unknown model name.'
  echo 'It is possible the model you selected worked previously but has been updated to a new version.'
  echo 'Resubmit this workflow with an appropriate model from the model list above.'
  echo '***'

  dorado basecaller         ${DRD_MODELS_PATH}/dna_r10.4.1_e8.2_400bps_sup@v4.0.0 .                           --device cuda:all | samtools view -b -o 0.ubam -
Command exit status:
  1
Command output:
  ***
  Available models:
  - dna_r10.4.1_e8.2_260bps_fast@v4.0.0
  - dna_r10.4.1_e8.2_260bps_hac@v4.0.0
  - dna_r10.4.1_e8.2_260bps_sup@v4.0.0
  - dna_r10.4.1_e8.2_400bps_fast@v4.0.0
  - dna_r10.4.1_e8.2_400bps_fast@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.1_e8.2_400bps_hac@v4.0.0
  - dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.1_e8.2_400bps_sup@v4.0.0
  - dna_r10.4.1_e8.2_400bps_sup@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.2_e8.2_4khz_stereo@v1.0
  - dna_r9.4.1_e8_fast@v3.4
  - dna_r9.4.1_e8_fast@v3.4_5mCG@v0
  - dna_r9.4.1_e8_hac@v3.3
  - dna_r9.4.1_e8_hac@v3.4_5mCG@v0
  - dna_r9.4.1_e8_sup@v3.3
  - dna_r9.4.1_e8_sup@v3.4_5mCG@v0
  - rna003_120bps_sup@v3
  ***
  You selected:
  Basecalling model: dna_r10.4.1_e8.2_400bps_sup@v4.0.0
  Remora model     : null
  ***
  A file open error below indicates that you have entered an unknown model name.
  It is possible the model you selected worked previously but has been updated to a new version.
  Resubmit this workflow with an appropriate model from the model list above.
  ***
Command error:
  [2023-01-24 13:17:11.536] [info] > Creating basecall pipeline
  [main_samview] fail to read the header from "-".
Work dir:
  /mnt/wsl/docker-desktop-bind-mounts/Ubuntu/7efa7bbc4df9a6c738411768d750f13aa3b26ee05b5799ad35110a3f2dc015b9/epi2melabs/instances/wf-basecalling_266cf119-0d01-401e-8572-d5b3ad2fc34b/work/17/c9974eb163cc92375c0ffc08d09bf3
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
WARN: Killing running tasks (1)

This is with using the manually extracted Demo Data.

Could this be a problem with my CUDA installation maybe? I know I did install CUDA drivers some time ago, but I never tested using them from a Docker image.

onkeld commented 1 year ago

Apparently, CUDA is not the problem - I can run a CUDA benchmark Docker container from within WSL without problems. Same results when trying the container from Docker Desktop in Windows.

The Problem must be somewhere between Epi2Me Labs Desktop Client, the workflows and the nextflow installation inside the docker containers in my opinion.

SamStudio8 commented 1 year ago

Thanks for all that @onkeld, as you've tried this on the demo data this at least shows it's not an input data/dorado problem. I'll try and reproduce this problem internally.

SamStudio8 commented 1 year ago

@onkeld Can you describe the hardware you are using to run these basecalling workflows, in particular what GPU are you using?

onkeld commented 1 year ago

@SamStudio8 I'm running a pretty beefy Desktop PC with a Nvidia GeForce GTX 1050 Ti graphics card right now. CUDA drivers are installed in Version 11.8. Should that be the problem, I'll have to wait for our high performance cluster to come out of its maintenance cycle...

Thanks for your help so far.

neurogenetics1 commented 1 year ago

I am getting exactly the same error. I also downloaded the model and set the argument --basecaller_model_path to the directory I downloaded the model. I got the same error. Please let me know if you find a solution.

--->my command:

nextflow run -profile biowulf epi2me-labs/wf-basecalling --basecaller_cfg "dna_r10.4.1_e8.2_400bps_hac@v4.0.0" --input /nanopore/AZ140021/20220307_1824_2D_PAI72310_c23d05c2/fast5_pass/ --ref /nanopore/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --dorado_ext fast5 --out_dir /nanopore/AZ140021/20220307_1824_2D_PAI72310_c23d05c2/output2023 --basecaller_basemod_threads 4

--->this is the error message I had:

Command output:

Available models:

dna_r10.4.1_e8.2_260bps_fast@v4.0.0
dna_r10.4.1_e8.2_260bps_hac@v4.0.0
dna_r10.4.1_e8.2_260bps_sup@v4.0.0
dna_r10.4.1_e8.2_400bps_fast@v4.0.0
dna_r10.4.1_e8.2_400bps_fast@v4.0.0_5mCG_5hmCG@v2
dna_r10.4.1_e8.2_400bps_hac@v4.0.0
dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2
dna_r10.4.1_e8.2_400bps_sup@v4.0.0
dna_r10.4.1_e8.2_400bps_sup@v4.0.0_5mCG_5hmCG@v2
dna_r10.4.2_e8.2_4khz_stereo@v1.0
dna_r9.4.1_e8_fast@v3.4
dna_r9.4.1_e8_fast@v3.4_5mCG@v0
dna_r9.4.1_e8_hac@v3.3
dna_r9.4.1_e8_hac@v3.4_5mCG@v0
dna_r9.4.1_e8_sup@v3.3
dna_r9.4.1_e8_sup@v3.4_5mCG@v0
rna003_120bps_sup@v3

You selected: Basecalling model: dna_r10.4.1_e8.2_400bps_sup@v4.0.0 Remora model : dna_r10.4.1_e8.2_400bps_sup@v4.0.0_5mCG_5hmCG@v2

A file open error below indicates that you have entered an unknown model name. It is possible the model you selected worked previously but has been updated to a new version. Resubmit this workflow with an appropriate model from the model list above.

onkeld commented 1 year ago

OK, I just found a potential problem: According to the dorado repository, my Graphics card is below the minimum specifications of dorado. When trying to run dorado standalone on my machine, I get the error message "no kernel image is available for execution on the device".

This could possibly lead to dorado not being able to process my data. Doesn''t explain the cryptic error message from the nextflow workflow though...

SamStudio8 commented 1 year ago

Hi Daniel, Sorry I thought I'd replied to yes yesterday about this! Indeed your GPU is a predecessor of the minimum architecture supported by Dorado. I had thought that Dorado printed an error to stderr for this case but perhaps it writes to stdout, causing the unhelpful samtools error.

The additional error text from the workflow about the model selection is baked in to try and help users understand a specific type of error case but it seems to just be causing confusion which I will address in a future update.

Sam

On Thu, 26 Jan 2023, 10:55 Daniel Zaumsegel, @.***> wrote:

OK, I just found a potential problem: According to the dorado repository, my Graphics card is below the minimum specifications of dorado. When trying to run dorado standalone on my machine, I get the error message "no kernel image is available for execution on the device".

This could possibly lead to dorado not being able to process my data. Doesn''t explain the cryptic error message from the nextflow workflow though...

— Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-human-variation/issues/17#issuecomment-1404840324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN6OSGAWR457I6MXNDZKDWUJJ2ZANCNFSM6AAAAAAUABTI7U . You are receiving this because you were mentioned.Message ID: @.***>

SamStudio8 commented 1 year ago

@neurogenetics1 Please can you open an issue on wf-basecalling with a full error report? There should be some additional information from the stderr that will help diagnose your problem.

On Thu, 26 Jan 2023, 03:38 neurogenetics1, @.***> wrote:

I am getting exactly the same error. I also downloaded the model and set the argument --basecaller_model_path to the directory I downloaded the model. I got the same error. Please let me know if you find a solution.

--->my command:

nextflow run -profile biowulf epi2me-labs/wf-basecalling --basecaller_cfg @.***" --input /nanopore/AZ140021/20220307_1824_2D_PAI72310_c23d05c2/fast5_pass/ --ref /nanopore/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --dorado_ext fast5 --out_dir /nanopore/AZ140021/20220307_1824_2D_PAI72310_c23d05c2/output2023 --basecaller_basemod_threads 4

--->this is the error message I had:

Command output:

Available models:

@.***

@.***

@.***

@.***

@.**@.

@.***

@.**@.

@.***

@.**@.

@.***

@.***

@.**@.

@.***

@.**@.

@.***

@.**@.

@.***

You selected: Basecalling model: @. Remora model : @*.**@*.***

A file open error below indicates that you have entered an unknown model name. It is possible the model you selected worked previously but has been updated to a new version. Resubmit this workflow with an appropriate model from the model list above.

— Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-human-variation/issues/17#issuecomment-1404528931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN6OUZROPVUSPLGFEW4ZTWUHWSPANCNFSM6AAAAAAUABTI7U . You are receiving this because you were mentioned.Message ID: @.***>

onkeld commented 1 year ago

Well, that explains a lot... Thanks for your help!

I consider this ticket closed for now and will try running stuff on our High Performance Cluster once the maintainance window is over. Should the error still appear there, I'll reopen as the GPU machines in the cluster are definitely above specs...

Thanks again!

SamStudio8 commented 1 year ago

@onkeld Hi Daniel, on your untar issue: we recently released EPI2ME Labs v4.1.0 that should resolve (or at least improve diagnosis of) this problem. The update can be downloaded via https://labs.epi2me.io/downloads/.

onkeld commented 1 year ago

@SamStudio8 the untar issue is resolved in the latest release of Epi2Me Labs and the workflow. Thanks again!

epi2me-labs / wf-human-variation