epi2me-labs / wf-human-variation

Other
87 stars 41 forks source link

Could not select device driver with capabilities gpu #22

Closed szakallasn3 closed 1 year ago

szakallasn3 commented 1 year ago

What happened?

Hello everyone!

I started epi2me lab's human variation workflow for the first time, and faced with dorado basecalling and clair3 problems. The error message was the following:

Error executing process > 'basecalling:wf_dorado:dorado (84)', Process 'basecalling: wf_dorado:dorado (84)' terminated with an error exit status (125) AND Error executing process > 'lookup_clair3_model (1)'.

This is a little bit confusing for me, because I followed the recommended steps from: https://github.com/epi2me-labs/wf-human-variation, checked and used the available models for dorado basecaller and clair3. The workflow was started in nextflow environment.

The terminal command line was: nextflow run epi2me-labs/wf-human-variation -r v1.0.1 -w clair3 -profile standard --snp --sv --methyl --fast_dir 'path_to_fast_dir' --basecaller_cfg 'dna_r10.4.1_e8.2_400bps_sup@v3.5.2' --remora_cfg 'r1041_e82_400bps_sup@g632' --ref 'path_to_reference_genome' --out_dir 'path_to_out_dir'

Some screenshots are attached to this issue, I hope they also help in solving this.

Screenshot from 2023-02-23 10-10-25 Screenshot from 2023-02-23 10-10-43

If anyone has faced with this or similar problem earlier and has the solution or any idea regarding this, please let me know. Really thanks for your help!

Operating System

ubuntu 18.04

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - CLI Execution Profile

None

Workflow Version

v1.0.1

Relevant log output

Error executing process > 'basecalling:wf_dorado:dorado (84)' 

Caused by: Process 'basecalling: wf_dorado:dorado (84)' terminated with an error exit status (125)

AND

Error executing process > 'lookup_clair3_model (1)'

Caused by:
  Process `lookup_clair3_model (1)` terminated with an error exit status (65)
SamStudio8 commented 1 year ago

Hi @szakallasn3, what GPU does the device you are running the workflow on have? The error seems to imply that Docker is not able to schedule the basecalling tasks to a GPU.

szakallasn3 commented 1 year ago

Thanks for your quick reply and for the tip. Now I specified the GPU usage as that I did previously during Guppy basecalling by --cuda_device command, however the following error remained:

Error executing process > 'lookup_clair3_model (1)'

Caused by: Process lookup_clair3_model (1) terminated with an error exit status (65)

Command executed:

clair3_model=$(resolve_clair3_model.py lookup_table 'dna_r10.4.1_e8.2_400bps_hac@v4.0.0') cp -r ${CLAIR_MODELS_PATH}/${clair3_model} model

Command exit status: 65

Command output: (empty)

Command error:

[CRITICAL ERROR] Unknown basecaller configuration.

The input basecaller configuration 'dna_r10.4.1_e8.2_400bps_hac@v4.0.0' does not have a suitable Clair3 model because the basecaller configuration has not been recognised.

Check your --basecaller_cfg has been provided correctly.

I checked and the --basecaller_cfg is provided correctly.

Do you have any suggestions?

SamStudio8 commented 1 year ago

@szakallasn3 Please update to wf-human-variation v1.1.0 where that model was added to the Clair3 lookup. If you're using Nextflow to manage your workflows you can update with nextflow pull epi2me-labs/wf-human-variation.

szakallasn3 commented 1 year ago

I made the update, however the error messages remained - the basecaller and clair3 model problems shown on screenshots.

SamStudio8 commented 1 year ago

I made the update, however the error messages remained - the basecaller and clair3 model problems shown on screenshots.

Hi @szakallasn3, would you mind sharing the latest stdout (the one with the big EPI2ME-labs logo), just so I can confirm the right version is loaded and to confirm your parameters.

szakallasn3 commented 1 year ago

Sure:

Screenshot from 2023-02-24 07-42-20

Screenshot from 2023-02-24 07-41-22

I also attached the error message.

Thanks for your help!

SamStudio8 commented 1 year ago

@szakallasn3 This error still indicates that Docker is not able to run containers with a GPU. Your device is likely missing the nvidia-container-toolkit. Please follow the instructions here to install the nvidia-container-toolkit. You will need to follow the steps to:

Once you have followed those steps in the linked documentation, you should be able to run this workflow.

szakallasn3 commented 1 year ago

Many thanks for your help. I made what you have suggested, however now I'm facing with the following problem:

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 31.75 GiB total capacity; 0 bytes already allocated; 6.62 MiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

and I have enough memory. As It is suggested in the error message, I changed the max_split_size_mb to avoid fragmentation and after running the human-variation-wf command again, I got the same error message. Do you have any suggestions? I read several github issues and stack overflow posts, where this problem was discussed, but unfortunately none of tips worked for me.

SamStudio8 commented 1 year ago

@szakallasn3 Glad that you can get Docker with GPU started now! Your new error is implying that your GPU is using up all its memory doing something else. You can use the nvidia-smi command to check what tasks are running on your GPU and how much free memory it has.

SamStudio8 commented 1 year ago

I'm closing this old issue but please re-open if you are still running into trouble.