Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.37k stars 1.03k forks source link

Is it possible to run Llama 2 70B with 80Gb? #1262

Open vabatista opened 6 months ago

vabatista commented 6 months ago

I'm trying to finetune Llama2 70B using a NVIDIA A100 with 80Gb, but even with batch-size = 1 I'm getting OOM error.

I'm using LoRA with quantization this way: plugins = BitsandbytesPrecision('nf4-dq', torch.bfloat16)

Am I missing something?

hjeon2k commented 6 months ago

I'm also having a similar problem with LLaMA 33B using an NVIDIA A100 80G * 2, even with a single micro-batch size. It is really confusing because when I execute the LoRA finetuning with a single GPU, it runs with 70GB of memory, but when I execute the same finetuning with the same configuration with a double GPU, it runs to OOM.

It seems that the FSDP strategy does not actually shards the pre-trained weights, but only the trainable parameters. (mentioned in https://github.com/pytorch/pytorch/issues/95805)

So I updated the torch to a nightly version, updated the lightning library, and fixed the FSDP strategy's policy, according to this. (mentioned in https://huggingface.co/docs/peft/accelerate/fsdp and referenced https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/utils/fsdp_utils.py) `

def fsdp_auto_wrap_policy(block: torch.nn.Module):  
    def lambda_policy_fn(module):
        if (
            len(list(module.named_children())) == 0
            and getattr(module, "weight", None) is not None
            and module.weight.requires_grad
        ):
            return True
        return False

    lambda_policy = partial(lambda_auto_wrap_policy, lambda_fn=lambda_policy_fn)
    transformer_wrap_policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={block})
    auto_wrap_policy = partial(_or_policy, policies=[lambda_policy, transformer_wrap_policy])

return auto_wrap_policy

auto_wrap_policy = fsdp_auto_wrap_policy(Block)
strategy = FSDPStrategy(auto_wrap_policy=auto_wrap_policy, state_dict_type="full", limit_all_gathers=True, 
cpu_offload=False)

`

Then, it seems that the pre-trained model is well-sharded to 2 GPUs, but when the training is started, I still get OOM.

Although the solution above didn't help me out, I hope this might be effective for you.

Plus, I would like to leave a question on (1) Whether the code in lit-gpt/finetune/lora.py can support lora fine-tuning with sharded 'pre-trained' model parameters, not only the sharded 'lora' parameters. (2) If not, is there any method or a workaround to support this? e.g. maybe using naive pipeline parallel. It seems many users are suffering from the same problem, yet an acute solution has not been suggested.

rasbt commented 6 months ago

Thanks for the very thorough comment and explanation, and thanks for sending the improved FSDP code along. I remember @awaelchli also looking into something FSDP-related, and maybe that's something that's relevant here. I think one workaround would be setting cpu_offload=True but that would obviously slow things down a lot.

I think in your case changing the micro batch sizes and context lengths are also not options because like you said it works on a single GPU already. (And you also say that you already use the lowest microbatch size.)

vabatista commented 6 months ago

I also tried to finetune a Mixtral 7x7B using A100 gpu and got OOM even with batch_size=1 and micro_batch_size=1. I'm using LoRA.