TsingZ0 / PFLlib

37 traditional FL (tFL) or personalized FL (pFL) algorithms, 3 scenarios, and 20 datasets.
GNU General Public License v2.0
1.35k stars 283 forks source link

CUDA error: unspecified launch failure #196

Closed gestefane closed 1 month ago

gestefane commented 2 months ago

Hello!

I am getting this issue. Can you help me, please?

Command to run:

python main.py -data AG_News -m lstm -algo FedAvg -gr 200 -did 0

Logs:

-------------Round number: 19-------------

Evaluate global model Averaged Train Loss: 1.5298 Averaged Test Accurancy: 0.2496 Averaged Test AUC: 0.8336 Std Test Accurancy: 0.3488 Std Test AUC: 0.0983 ------------------------- time cost ------------------------- 101.69626379013062

-------------Round number: 20-------------

Evaluate global model Traceback (most recent call last): File "/mnt/hdd_0/myuser/experiments/PFLlib/system/main.py", line 541, in run(args) File "/mnt/hdd_0/myuser/experiments/PFLlib/system/main.py", line 373, in run server.train() File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/servers/serveravg.py", line 48, in train self.evaluate() File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/servers/serverbase.py", line 246, in evaluate stats_train = self.train_metrics() ^^^^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/servers/serverbase.py", line 235, in train_metrics cl, ns = c.train_metrics() ^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/clients/clientbase.py", line 153, in train_metrics output = self.model(x) ^^^^^^^^^^^^^ File "/home/myuser/miniconda3/envs/fl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/trainmodel/models.py", line 34, in forward out = self.base(x) ^^^^^^^^^^^^ File "/home/myuser/miniconda3/envs/fl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/hdd_0/myuser/experiments/PFLlib/system/flcore/trainmodel/models.py", line 440, in forward out, out_lengths = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/myuser/miniconda3/envs/fl/lib/python3.11/site-packages/torch/nn/utils/rnn.py", line 337, in pad_packed_sequence return padded_output.index_select(batch_dim, unsorted_indices), lengths[unsorted_indices.cpu()] ^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions

I did a check torch version and seems ok:

(fl) myuser@machine:/mnt/hdd_0/myuser/experiments/PFLlib$ more env_cuda_latest.yaml name: fl channels:

(fl) myuser@machine:/mnt/hdd_0/myuser/experiments/PFLlib$ pip list|grep torch torch 2.0.1 torchaudio 2.0.2 torchdata 0.6.1 torchtext 0.15.2 torchvision 0.15.2

TsingZ0 commented 1 month ago

The issue may arise from an incompatibility due to changes in the environment, such as a GPU driver update. The code operates correctly on my machine. Please reinstall all the packages using the latest yaml file.