ai-computing / aicomp

Other
6 stars 0 forks source link

Support for optimizer state offloading #15

Open ememos opened 1 month ago

ememos commented 1 month ago

The Adam optimizer can consume a large amount of GPU memory, potentially causing OOM (Out Of Memory) errors during training. To free up memory during forward/backward passes, there is a need for a feature that allows offloading the optimizer's state to CPU memory when necessary.

ememos commented 1 month ago

During the training of large models, if there is a shortage of memory before and after the backward pass, a feature has been added to offload the optimizer state to the CPU. The offloaded state is then retrieved before performing the optimizer step. This process can be performed on each GPU depending on the memory situation of that GPU, or it can proceed without offloading as usual.