Closed mpatel31415 closed 5 months ago
Thunder should support activation checkpointing, activation offloading, and sequence parallelism to enable long context models.
Modifying the benchmark script to measure PyTorch performance on these models is a good first step to keep reminding ourselves about the need for this feature.
🚀 Feature
Add activation checkpointing to benchmark_litgpt script.
Motivation
LitGPT models like: Mistral-7B-v0.2, vicuna-13b-v1.5-16k, longchat-13b-16k, CodeLlama-13b-hf, CodeLlama-34b-hf has larger context length, which causes the memory needed to store activation values to be high. FSDP doesn't shard it, so we can get OOM errors irrespectiva of number GPUs used.
Pitch
Changes include:
Benchmark_litGPT
(checkpoint_activations)setup_activation
, which will be used depending on value ofcheckpoint_activations
Alternatives
Another alternative could be tensor parallelism or sequence parallelism.
cc @crcrpar