epi2me-labs / wf-basecalling

Other
28 stars 9 forks source link

CUDA device requested but no devices found. #12

Closed SamStudio8 closed 1 year ago

SamStudio8 commented 1 year ago

Ask away!

Hello, I am encountering the same issue. I am attempting to use dorado through epi2me and docker. Other epi2me workflows work for me as does base calling through guppy using cuda from the commandline. I am on a ThinkPad T580 running ubuntu 20.04. Below is parts of the logfile of epi2me and a screenshot of docker invoking nvidia-smi. Any help is greatly appreciated. In any case, the workflow would benefit from a command to test and trouble shoot the setup. thanks a lot!

This is epi2me-labs/wf-basecalling v0.7.2.

[34/d9c940] Submitted process > getParams [d8/26b1e8] Submitted process > getVersions [ea/7aa6db] Submitted process > wf_dorado:dorado (3) [ec/d81b14] Submitted process > wf_dorado:dorado (1) ERROR ~ Error executing process > 'wf_dorado:dorado (3)' Caused by: Process wf_dorado:dorado (3) terminated with an error exit status (1) Command executed: set +e source /opt/nvidia/entrypoint.d/*-gpu-driver-check.sh # runtime driver check msg set -e dorado basecaller ${DRD_MODELS_PATH}/dna_r9.4.1_e8_hac@v3.3 . --device cuda:all | samtools view --no-PG -b -o 2.ubam - Command exit status: 1 Command output: (empty) Command error: [2023-07-08 14:43:32.251] [info] > Creating basecall pipeline [W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (function operator()) [2023-07-08 14:43:32.258] [error] CUDA device requested but no devices found. [main_samview] fail to read the header from "-". Work dir: /home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/work/ea/7aa6dbaac4484b161e5b896a7de3f5 Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh -- Check '/home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/nextflow.log' file for details WARN: Killing running tasks (1)

Screenshot from 2023-07-08 16-53-02

Originally posted by @warthmann in https://github.com/nanoporetech/dorado/issues/251#issuecomment-1627371346

SamStudio8 commented 1 year ago

@warthmann I suspect your CUDA version (11.6) is too old, does your nvidia-smi example work for nvidia/cuda:11.8.0-base-ubuntu20.04?

warthmann commented 1 year ago

I changed the command to cuda:11.8.0, getting the attached. (Note: On my system I am working with Cuda 11.4)

Screenshot from 2023-07-10 14-23-09

warthmann commented 1 year ago

trying to run the wotkflow again resulted in the same error Command error: [2023-07-10 12:35:50.042] [info] > Creating basecall pipeline [W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (function operator()) [2023-07-10 12:35:50.073] [error] CUDA device requested but no devices found. [main_samview] fail to read the header from "-".

SamStudio8 commented 1 year ago

Just wondering, does this also work without root? It is possible that root and pbgl are speaking to different Docker engines.

SamStudio8 commented 1 year ago

Sorry for the runaround, I just realised that if the workflow is able to get as far as actually attempting to run Dorado, that the workflow and Docker are happy with your GPU set up. What GPU are you trying to run this workflow on? I ask as it might not be compatible with Dorado -- a GPU with Volta generated architecture (or newer) is required.

warthmann commented 1 year ago

No Volta: NVIDIA GeForce MX150

SamStudio8 commented 1 year ago

Ah! Indeed, I'm afraid that GPU won't be compatible with Dorado. I'm going to close this issue but will raise a ticket internally to see if we can have the workflow check the generation of the card to provide a better error than the one that you got.