Fix illegal memory access with multi_tensor_apply size above INT_MAX

gdb commented 3 months ago

Currently, multi_tensor_apply causes an illegal memory access due to an overflow in the sizes field of TensorListMetadata. This can be reproduced using the following standalone script:

import torch, amp_C
from apex.multi_tensor_apply import multi_tensor_applier
multi_tensor_adam = amp_C.multi_tensor_adam

size = 2**32+1
g_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
p_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
m_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
v_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
_dummy_overflow_buf = torch.zeros(1, dtype=torch.int32, device='cuda')

multi_tensor_applier(multi_tensor_adam, _dummy_overflow_buf, [g_32, p_32, m_32, v_32], 0.0, 0.9, 0.95, 1e-08, 1, 1, 1, 0.1)
print(g_32)

awgu commented 3 months ago

cc: @crcrpar are the following out of date: https://github.com/NVIDIA/apex/blob/b3bd26a8004007e6d2d098934e063b952cab86f1/csrc/multi_tensor_apply.cuh#L15-L17 I see the same limits in PyTorch where you already updated to use int64_t in https://github.com/pytorch/pytorch/pull/101760. Otherwise, I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

(Though, it seems that CUDA 12.1 on Volta+ increased the kernel arg size limit from 4 KB to 32 KB.)

crcrpar commented 3 months ago

I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

Yes, but apex does not have multi-tensor-apply with a list of scalars so we might be able to dodge a tweak of depth_to_max_tensors and depth_to_max_blocks

firoj0 commented 2 months ago

NVIDIA / apex

Fix illegal memory access with multi_tensor_apply size above INT_MAX #1825