AGI-Edgerunners / LLM-Adapters

Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"
https://arxiv.org/abs/2304.01933
Apache License 2.0
1.08k stars 104 forks source link

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

Open pulkitmehtaworkmetacube opened 1 week ago

pulkitmehtaworkmetacube commented 1 week ago

We did following :

  1. Took nvidia/Llama-3.1-Nemotron-70B-Instruct-HF base model and performed fine tuning using our custom data set for classification task . Training completed in 6 hrs or so and we got adapter weights.

  2. Trying to do inferencing on our test set by first loading base model then adapter weights using PEFT .

We have 2 A 100 80 GB GPUs . After step 1 , we have around 67 GB GPU util on each GPU while after loading adapter one of the GPU gets to 80 GB mark and we get message Some parameters are on the meta device because they were offloaded to the cpu.

We also tried loading base model in 8 bits but then we are getting error TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations: [(torch.Size([170, 8192]), device(type='cuda', index=0)), (torch.Size([8192, 8192]), device(type='cuda', index=1)), (torch.Size([170, 8192]), device(type='cuda', index=0))]

Any suggestion , leads will be highly appreciated .

pulkitmehtaworkmetacube commented 1 week ago

Please review this and provide help