Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B

We did following :

Took nvidia/Llama-3.1-Nemotron-70B-Instruct-HF base model and performed fine tuning using our custom data set for classification task . Training completed in 6 hrs or so and we got adapter weights.
Trying to do inferencing on our test set by first loading base model then adapter weights using PEFT .

We have 2 A 100 80 GB GPUs . After step 1 , we have around 67 GB GPU util on each GPU while after loading adapter one of the GPU gets to 80 GB mark and we get message Some parameters are on the meta device because they were offloaded to the cpu.

We also tried loading base model in 8 bits but then we are getting error TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations: [(torch.Size([170, 8192]), device(type='cuda', index=0)), (torch.Size([8192, 8192]), device(type='cuda', index=1)), (torch.Size([170, 8192]), device(type='cuda', index=0))]

Any suggestion , leads will be highly appreciated .

AGI-Edgerunners / LLM-Adapters

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73