Closed hkunzhe closed 3 months ago
By introducing gradient checkpointing, batch size and gradient accumulation steps can be set from (1, 4) to (4, 1) while ensuring the same gradient update steps, saving more VRAM (20G -> 14G) and training time (20min -> 15min, A100).
By introducing gradient checkpointing, batch size and gradient accumulation steps can be set from (1, 4) to (4, 1) while ensuring the same gradient update steps, saving more VRAM (20G -> 14G) and training time (20min -> 15min, A100).