epi2me-labs / wf-human-variation

Other
96 stars 42 forks source link

[Bug]: Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /home/epi2melabs/dorado/lib/libcublas.so.11: undefined symbol: cublasGetSmCountTarget #37

Closed rainwala closed 1 year ago

rainwala commented 1 year ago

What happened?

I was running wf-human-variation on Ubuntu 20.04, with 4 A100s, I think it's the Docekr profile. I ran the following command, and then it crashed: sudo nextflow run epi2me-labs/wf-human-variation -w workspace -profile standard --snp --sv --methyl --fast5_dir /fast5 --basecaller_cfg 'dna_r10.4.1_e8.2_400bps_hac@v4.0.0' --remo ra_cfg 'dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2' --ref /Homo_sapiens.GRCh38.106_no_alt_chrs.fa --out_dir patient31_drug/ --threads 16 --cuda_device="cuda:0,1,2,3"

Operating System

ubuntu 20.04

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - CLI Execution Profile

Docker

Workflow Version

1.4.0

Relevant log output

ERROR ~ Error executing process > 'basecalling:wf_dorado:dorado (128)'

Caused by:
  Process `basecalling:wf_dorado:dorado (128)` terminated with an error exit status (1)

Command executed:

  echo '***'
  echo 'Available models:'
  list-models | sed 's,^,- ,' | sed "s,${DRD_MODELS_PATH}/,,"
  echo '***'
  echo 'You selected:'
  echo "Basecalling model: dna_r10.4.1_e8.2_400bps_hac@v4.0.0"
  echo "Remora model     : dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2"
  echo '***'
  echo 'A file open error below indicates that you have entered an unknown model name.'
  echo 'It is possible the model you selected worked previously but has been updated to a new version.'
  echo 'Resubmit this workflow with an appropriate model from the model list above.'
  echo '***'

  dorado basecaller         ${DRD_MODELS_PATH}/dna_r10.4.1_e8.2_400bps_hac@v4.0.0 .         --modified-bases-models ${DRD_MODELS_PATH}/dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2                  --device cuda:0,1,2,3 | samtools view -b -o 127.ubam
 -

Command exit status:
  1

Command output:
  ***
  Available models:
  - dna_r10.4.1_e8.2_260bps_fast@v4.0.0
  - dna_r10.4.1_e8.2_260bps_hac@v4.0.0
  - dna_r10.4.1_e8.2_260bps_sup@v4.0.0
  - dna_r10.4.1_e8.2_400bps_fast@v4.0.0
  - dna_r10.4.1_e8.2_400bps_fast@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.1_e8.2_400bps_hac@v4.0.0
  - dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.1_e8.2_400bps_sup@v4.0.0
  - dna_r10.4.1_e8.2_400bps_sup@v4.0.0_5mCG_5hmCG@v2
  - dna_r10.4.2_e8.2_4khz_stereo@v1.0
  - dna_r9.4.1_e8_fast@v3.4
  - dna_r9.4.1_e8_fast@v3.4_5mCG@v0
  - dna_r9.4.1_e8_hac@v3.3
  - dna_r9.4.1_e8_hac@v3.4_5mCG@v0
  - dna_r9.4.1_e8_sup@v3.3
  - dna_r9.4.1_e8_sup@v3.4_5mCG@v0
  - rna003_120bps_sup@v3
  ***
  You selected:
  Basecalling model: dna_r10.4.1_e8.2_400bps_hac@v4.0.0
  Remora model     : dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2
  ***
  A file open error below indicates that you have entered an unknown model name.
  It is possible the model you selected worked previously but has been updated to a new version.
  Resubmit this workflow with an appropriate model from the model list above.
  ***

Command error:
  - dna_r9.4.1_e8_sup@v3.4_5mCG@v0
  - rna003_120bps_sup@v3
  ***
  You selected:
  Basecalling model: dna_r10.4.1_e8.2_400bps_hac@v4.0.0
  Remora model     : dna_r10.4.1_e8.2_400bps_hac@v4.0.0_5mCG_5hmCG@v2
  ***
  A file open error below indicates that you have entered an unknown model name.
  It is possible the model you selected worked previously but has been updated to a new version.
  Resubmit this workflow with an appropriate model from the model list above.
  ***
  [2023-04-29 21:11:47.528] [info] > Creating basecall pipeline
  Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /home/epi2melabs/dorado/lib/libcublas.so.11: undefined symbol: cublasGetSmCountTarget
  terminate called after throwing an instance of 'c10::CuDNNError'
    what():  cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
  Exception raised from _cudnn_rnn at ../aten/src/ATen/native/cudnn/RNN.cpp:1076 (most recent call first):
  frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f63c60ff20e in /home/epi2melabs/dorado/lib/libc10.so)
  frame #1: <unknown function> + 0xb006c (0x7f63e034206c in /home/epi2melabs/dorado/lib/libtorch_cuda_cpp.so)
  frame #2: <unknown function> + 0x2d8c217 (0x7f6398d16217 in /home/epi2melabs/dorado/lib/libtorch_cuda_cu.so)
  frame #3: <unknown function> + 0x2d8c2c3 (0x7f6398d162c3 in /home/epi2melabs/dorado/lib/libtorch_cuda_cu.so)
  frame #4: at::_ops::_cudnn_rnn::call(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&) + 0x29f (0x7f63c8233acf in /home/epi2melabs/dorado/lib/libtorch_cpu.so)
  frame #5: <unknown function> + 0x15181a (0x7f63e03e381a in /home/epi2melabs/dorado/lib/libtorch_cuda_cpp.so)
  frame #6: at::native::lstm(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) + 0x18a (0x7f63c7da495a in /home/epi2melabs/dorado/lib/libtorch_cpu.so)
  frame #7: <unknown function> + 0x273ed6d (0x7f63c897fd6d in /home/epi2melabs/dorado/lib/libtorch_cpu.so)
  frame #8: at::_ops::lstm_input::call(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) + 0x264 (0x7f63c83af994 in /home/epi2melabs/dorado/lib/libtorch_cpu.so)
  frame #9: torch::nn::LSTMImpl::forward_helper(at::Tensor const&, at::Tensor const&, at::Tensor const&, long, c10::optional<std::tuple<at::Tensor, at::Tensor> >) + 0x612 (0x7f63cab31032 in /home/epi2melabs/dorado/lib/libtorch_cpu.so)
  frame #10: torch::nn::LSTMImpl::forward(at::Tensor const&, c10::optional<std::tuple<at::Tensor, at::Tensor> >) + 0xbc (0x7f63cab311dc in /home/epi2melabs/dorado/lib/libtorch_cpu.so)
  frame #11: dorado() [0x593e6e]
  frame #12: dorado() [0x592b92]
  frame #13: dorado() [0x599d4a]
  frame #14: dorado() [0x599796]
  frame #15: dorado() [0x5994a1]
  frame #16: dorado() [0x59906f]
  frame #17: dorado() [0x594ffa]
  frame #18: dorado() [0x594428]
  frame #19: dorado() [0x58d66d]
  frame #20: dorado() [0x5b3609]
  frame #21: dorado() [0x5b3510]
  frame #22: dorado() [0x5bc1c3]
  frame #23: dorado() [0x5bc030]
  frame #24: dorado() [0x5bbf17]
  frame #25: dorado() [0x5bbdb8]
  frame #26: dorado() [0x5bbcfc]
  frame #27: <unknown function> + 0x145a0 (0x7f63ef27d5a0 in /home/epi2melabs/dorado/lib/libtorch_cuda.so)
  frame #28: <unknown function> + 0x8609 (0x7f63957fc609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  frame #29: clone + 0x43 (0x7f63953c9133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

  [E::sam_parse1] SEQ and QUAL are of different length
  [W::sam_read1_sam] Parse error at line 2353
  samtools view: error reading file "-"

Work dir:
  /home/prom/epi2me-labs_wf-human-variation_nextflow/workspace/ab/1193f70ffc090f1e5204741ea4967e

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
rainwala commented 1 year ago

I found something that could be related online: https://github.com/nanoporetech/dorado/issues/69

rainwala commented 1 year ago

Update, I ran dorado v 0.2.4 successfully with my data with the --modified_bases flag on with on my setup. Telllingly, on the dorado versions page, this version was made specifically to "https://github.com/nanoporetech/dorado/commit/92ef398874e9f4d09c7a57e5d979d4e704a12a74 - Fix out of bound access when modbase calling"

I think to get around this issue on this wrokflow, it may be necessary to change the dorado version in the nextflow.config file basecaller_container = "dorado:sha097d9c8abc39b8266e3ee58f531f5ef8944a02c3"

SamStudio8 commented 1 year ago

@rainwala Thanks for confirming the newer Dorado fixes your issue. We are working to update the basecalling component of wf-human-variation to have compatibility with the downstream models used by the workflow.

rainwala commented 1 year ago

That sounds good. FYI the current wf-human-variation workflow works for me on my setup when using r9.4.1 basecalling models with Dorado.