CUDA error: unspecified launch failure

Hello!

I am getting this issue. Can you help me, please?

Command to run:

python main.py -data AG_News -m lstm -algo FedAvg -gr 200 -did 0

Logs:

-------------Round number: 19-------------

Evaluate global model Averaged Train Loss: 1.5298 Averaged Test Accurancy: 0.2496 Averaged Test AUC: 0.8336 Std Test Accurancy: 0.3488 Std Test AUC: 0.0983 ------------------------- time cost ------------------------- 101.69626379013062

-------------Round number: 20-------------

Evaluate global model Traceback (most recent call last): File "/mnt/hdd_0/myuser/experiments/PFLlib/system/main.py", line 541, in run(args) File "/mnt/hdd_0/myuser/experiments/PFLlib/system/main.py", line 373, in run server.train() File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/servers/serveravg.py", line 48, in train self.evaluate() File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/servers/serverbase.py", line 246, in evaluate stats_train = self.train_metrics() ^^^^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/servers/serverbase.py", line 235, in train_metrics cl, ns = c.train_metrics() ^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/clients/clientbase.py", line 153, in train_metrics output = self.model(x) ^^^^^^^^^^^^^ File "/home/myuser/miniconda3/envs/fl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/trainmodel/models.py", line 34, in forward out = self.base(x) ^^^^^^^^^^^^ File "/home/myuser/miniconda3/envs/fl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/trainmodel/models.py", line 440, in forward out, out_lengths = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/myuser/miniconda3/envs/fl/lib/python3.11/site-packages/torch/nn/utils/rnn.py", line 337, in pad_packed_sequence return padded_output.index_select(batch_dim, unsorted_indices), lengths[unsorted_indices.cpu()] ^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions

I did a check torch version and seems ok:

(fl) myuser@machine:/mnt/hdd_0/myuser/experiments/PFLlib$ more env_cuda_latest.yaml name: fl channels:

pytorch
nvidia
defaults dependencies:
pip=22
pandas
scikit-learn
scipy
ujson
h5py
seaborn
matplotlib
click
pip:
- torch==2.0.1
- torchaudio
- torchtext
- torchvision
- calmsize
- memory-profiler
- portalocker
- cvxpy
- higher
- diffusers
- accelerate
- transformers

(fl) myuser@machine:/mnt/hdd_0/myuser/experiments/PFLlib$ pip list|grep torch torch 2.0.1 torchaudio 2.0.2 torchdata 0.6.1 torchtext 0.15.2 torchvision 0.15.2

TsingZ0 / PFLlib

CUDA error: unspecified launch failure #196