NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.16k stars 1.35k forks source link

64-bit indexing Adam #1786

Open cdm114514 opened 3 months ago

cdm114514 commented 3 months ago

Issues:

  1. Incomplete Testing in testLargeTensor Method: Location: tests/L0/run_optimizers/test_adam.py. Description: The method aimed to compare the correctness of FusedAdam by applying the step() function to two large tensors with same gradient(another one using torch.optim.adam). However, the test only invoked step() on the first optimizer.

  2. Type Overflow in TensorListMetadata: Location: csrc/multi_tensor_apply.cuh Description: The data structures sizes[] and block_to_chunk[] within TensorListMetadata were statically typed as integers. This led to overflow when managing tensors with lengths surpassing INT_MAX.

Solution: