Open carbonscott opened 2 weeks ago
File "/sdf/data/lcls/ds/prj/prjcwang31/results/proj-peaknet/train.distill.py", line 396, in <module>
timestamp = init_logger(
File "/sdf/home/c/cwang31/codes/peaknet/peaknet/utils_fsdp.py", line 692, in init_logger
timestamp = broadcast_dict(dict(timestamp=timestamp), src = 0, device = device).get('timestamp')
File "/sdf/home/c/cwang31/codes/peaknet/peaknet/utils_fsdp.py", line 255, in broadcast_dict
tensor_size = torch.tensor([0], dtype=torch.long, device = device)
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
It's cumbersome to identify faulty GPUs in a multi node system.