alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Apache License 2.0
434 stars 36 forks source link

Qwen Chat CUDA OutOfMemory #63

Open xorange opened 1 month ago

xorange commented 1 month ago

RTX 4090 24G, Qwen-7B-Chat

loads OK:

model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)

But the following causes OutOfMemoryError

# rtp_sys.conf
#
# [
#     {"task_id": 1, "prompt": " <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>"}
# ]

import os
os.environ['MULTI_TASK_PROMPT'] = './rtp_sys.conf'
model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)

File "/data1/miniconda/xxx/rtp-llm/lib/python3.10/site-packages/maga_transformer/utils/model_weights_loader.py", line 304, in _load_layer_weight
    tensor = self._split_and_sanitize_tensor(tensor, weight).to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory.

I've tried with and without export ENABLE_FMHA=OFF I'm referring to this link SystemPrompt-Tutorial

For the record, my requirement here are:

  1. have 2 LoRAs, and during one round chat I have to switch between them
  2. I need to use chat interface. Since Qwen does not come with chat_template, I need a way to implement "make_context"

Because of requirement 1, python3 -m maga_transformer.start_server + http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)

netaddi commented 1 month ago

Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.

xorange commented 1 month ago

Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.

I'm not sure why that rtp-llm loads this model successfully, but then fail when provided with a chat template.

I did not even start to chat.