Serve-model running out of memory with multiple GPUs

eth-sri / lmql

A language for constraint-guided and efficient LLM programming.

https://lmql.ai

Apache License 2.0

3.61k stars 194 forks source link

Serve-model running out of memory with multiple GPUs #287

Open motaatmo opened 9 months ago

motaatmo commented 9 months ago

Hi,

I'm trying to process large context sizes with lmql serve-model. Unfortunately, I'm running into "CUDA out of memory" issues. Using 1, 2, or 3 A100 did not make a difference. Reading the documentation I had the impression that lmql serve-model should combine the VRAM of multiple GPUs, am I wrong in this regard? Greetings,

Moritz

lbeurerkellner commented 9 months ago

Hi there Moritz, with transformers we use device_map='auto', which should automatically make use of all available GPUs (e.g. specified via CUDA_VISIBLE_DEVICES). Could you check with nvidia-smi that all GPUs are indeed used by the inference process?

motaatmo commented 9 months ago

Hi,

I did. I'm using Mistral 7B, and upon start, both used GPUs (A100 each) fill their VRAM up to until half (which seems already too much given the model size). In those runs that eventually fail, with sequence lengths of about 4500 Tokens, there seems to be a pattern that the VRAM on both cards fills alternatingly to nearly 100% (Card 1, then Card 2, then again Card 1 etc). Therefore, both cards are definitely in use. It looks as if the input data (plus intermediate data, I guess) gets swapped between the A100s. Unfortunately, I have not more than a very basic understanding of what is going on under the hood, so I can't tell for sure.

lbeurerkellner commented 9 months ago

Can you try running a long generation, just with the transformers API, AutoModelForCausalLM.generate and device_map="auto". From your description, it sounds like the model is cloned to each card, not distributed.

motaatmo commented 9 months ago

I will do so! Unfortunately, right now there is a queue on the SLURM node I've got to use, therefore it might take some time. Just to be sure to do the correct thing as soon as I am able to do so: A simple script like the following would be enough?

from transformers import AutoModelForCausalLM, AutoTokenizer  # type: ignore

model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
input = tokenizer("The history of the internet starts with: ")
output = model.generate(input)
print(output)

And, while running, I monitor nvidia-smi to see how the VRAM usage develops. Correct?