Open szhengac opened 1 year ago
A cursory glance at the adam optimizer shows lots of int
usage for indexing and the tensor size is greater than INT_MAX
(2**31 - 1
) so this could be expected. CC @crcrpar
@szhengac Hi, have you found the solutions?
I guess the cause is the same as https://github.com/pytorch/pytorch/issues/101449 which I've been working on. Bear with me for a while, just naively changing int to int64_t would sacrifice the performance of the cases which doesn't mandate int64_t
Can we also have a update on apex?
Is there any progress?
Any update on this?
Describe the Bug On H100 SXM5, Adam optimizer kernel standalone results in CUDA error: an illegal memory access was encountered with certain tensor size such as 2359332864. The GPU has 80GB, while 2359332864 elements would use 35GB at most.
Minimal Steps/Code to Reproduce the Bug
Error:
**Expected Behavior** **Environment** ``` OS: Ubuntu 20.04.6 LTS GPU count and types: A single machine with 8 H100 SXM5 Interconnects: NVSwitch Python version: Python 3.8.10 Container: [nvcr.io/nvidia/pytorch:23.02-py3](http://nvcr.io/nvidia/pytorch:23.02-py3) ```