NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.32k stars 1.38k forks source link

[BUG] CUDA error: an illegal memory access was encountered with Adam optimizer on H100 #1654

Open szhengac opened 1 year ago

szhengac commented 1 year ago

Describe the Bug On H100 SXM5, Adam optimizer kernel standalone results in CUDA error: an illegal memory access was encountered with certain tensor size such as 2359332864. The GPU has 80GB, while 2359332864 elements would use 35GB at most.

Minimal Steps/Code to Reproduce the Bug

import torch
from apex.optimizers import FusedAdam

t = torch.zeros(2359332864, dtype=torch.float, device='cuda')
t.grad = torch.zeros_like(t)
params = [t]
optimizer = FusedAdam(params)
optimizer.step()
torch.cuda.synchronize()

Error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
**Expected Behavior** **Environment** ``` OS: Ubuntu 20.04.6 LTS GPU count and types: A single machine with 8 H100 SXM5 Interconnects: NVSwitch Python version: Python 3.8.10 Container: [nvcr.io/nvidia/pytorch:23.02-py3](http://nvcr.io/nvidia/pytorch:23.02-py3) ```
eqy commented 1 year ago

A cursory glance at the adam optimizer shows lots of int usage for indexing and the tensor size is greater than INT_MAX (2**31 - 1) so this could be expected. CC @crcrpar

00INDEX commented 1 year ago

@szhengac Hi, have you found the solutions?

crcrpar commented 1 year ago

I guess the cause is the same as https://github.com/pytorch/pytorch/issues/101449 which I've been working on. Bear with me for a while, just naively changing int to int64_t would sacrifice the performance of the cases which doesn't mandate int64_t

Godricly commented 1 year ago

Can we also have a update on apex?

sxthunder commented 11 months ago

Is there any progress?

yangky11 commented 9 months ago

Any update on this?