carbonscott / exp-peaknet

Run peaknet experiments
0 stars 1 forks source link

Need better error handling to identify faulty GPUs as the ECC error happens #7

Open carbonscott opened 2 weeks ago

carbonscott commented 2 weeks ago

It's cumbersome to identify faulty GPUs in a multi node system.

carbonscott commented 1 week ago
  File "/sdf/data/lcls/ds/prj/prjcwang31/results/proj-peaknet/train.distill.py", line 396, in <module>
    timestamp = init_logger(
  File "/sdf/home/c/cwang31/codes/peaknet/peaknet/utils_fsdp.py", line 692, in init_logger
    timestamp = broadcast_dict(dict(timestamp=timestamp), src = 0, device = device).get('timestamp')
  File "/sdf/home/c/cwang31/codes/peaknet/peaknet/utils_fsdp.py", line 255, in broadcast_dict
    tensor_size = torch.tensor([0], dtype=torch.long, device = device)
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.