speculate sampling用medusa加载medusa官方模型报错

wcsjtu commented 3 weeks ago

环境
- docker: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:0.1.13_cuda12
- cuda: 12.1
- driver: 515.105.01
- 模型：
- llama: https://huggingface.co/lmsys/vicuna-33b-v1.3
- medusa: https://huggingface.co/FasterDecoding/medusa-vicuna-33b-v1.3
- 启动命令

CHECKPOINT_PATH=/root/tmpfs/lmsys--vicuna-33b-v1.3 MODEL_TYPE=llama CUDA_VISIBLE_DEVICES=6,7 TOKENIZER_PATH=/root/tmpfs/lmsys--vicuna-33b-v1.3/ TP_SIZE=2 WORLD_SIZE=2 START_PORT=8514 SP_CHECKPOINT_PATH=/root/tmpfs/medusa-vicuna-33b-v1.3/ GEN_NUM_PER_CIRCLE=4 python3 -m maga_transformer.start_server

在启动之前，我根据文档上写的，把llama的config.json添加了

"medusa_config": {
      "medusa_num_heads": 2,
      "medusa_num_layers": 1
  }

结果加载模型时就报错

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py", line 35, in local_rank_start
    raise e
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py", line 32, in local_rank_start
    app.start()
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/server/inference_app.py", line 38, in start
    self.inference_server.start()
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/server/inference_server.py", line 60, in start
    self._inference_worker = InferenceWorker()
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/server/inference_worker.py", line 55, in __init__
    self.model: AsyncModel = ModelFactory.create_from_env()
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 180, in create_from_env
    model = ModelFactory.from_model_config(normal_model_config, sp_model_config)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 72, in from_model_config
    model = ModelFactory._create_model(model_config)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 50, in _create_model
    model = model_cls.from_config(config)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/models/gpt.py", line 107, in from_config
    return cls(config)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/models/gpt.py", line 83, in __init__
    self.load()
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/models/gpt.py", line 130, in load
    self._load_weights(self.config.ref_model, self.config.ref_dict, device)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/models/gpt.py", line 160, in _load_weights
    self.weight = model_weights_loader.load_weights_from_scratch(num_process=load_parallel_num)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/utils/model_weights_loader.py", line 130, in load_weights_from_scratch
    for name, tensor in self._load_medusa_weights(self._model_weights_info.medusa_weights):
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/utils/model_weights_loader.py", line 146, in _load_medusa_weights
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/utils/model_weights_loader.py", line 143, in _load_medusa_weights
    results.append((name, self.load_tensor(name)[0]))
IndexError: list index out of range

看起来像是 SP_CHECKPOINT_PATH=/root/tmpfs/medusa-vicuna-33b-v1.3/ 这个参数没生效，我用一个不存在的路径也报同样的错误。

baowendin commented 3 weeks ago

使用medusa，不需要配置SP_CHECKPOINT_PATH，把medusa的ckpt放到/root/tmpfs/lmsys--vicuna-33b-v1.3里并且修改config.json即可

wcsjtu commented 3 weeks ago

使用medusa，不需要配置SP_CHECKPOINT_PATH，把medusa的ckpt放到/root/tmpfs/lmsys--vicuna-33b-v1.3里并且修改config.json即可

确实是这个问题。我现在能拉起模型了。不过有个别的问题，

文档是这么写的

同时需要确保ckpt里的medusa weight格式是medusa_head.{head}.{layer}.linear.weight, medusa_head.{head}.{layer}.linear.bias, medusa_head.{head}.{self._medusa_layer_num}.weight，与medusa官方repo保持一致

但是medusa官方huggingface上的权重格式是 {head}.{layer}.linear.weight, 前面没有medusa_head.这个前缀，我手动加上后才能正确加载起来。

官方权重在这 https://huggingface.co/FasterDecoding/medusa-vicuna-33b-v1.3

wcsjtu commented 3 weeks ago

有个新问题。服务是起来了，但是请求compelete接口，服务就报错退出了。

环境还是上面的环境，模型也一样。启动命令是

CHECKPOINT_PATH=/root/tmpfs/lmsys--vicuna-33b-v1.3 \
MODEL_TYPE=llama \
CUDA_VISIBLE_DEVICES=6,7 \
TOKENIZER_PATH=/root/tmpfs/lmsys--vicuna-33b-v1.3/ \
TP_SIZE=2 \
WORLD_SIZE=2 \
START_PORT=8514 \
GEN_NUM_PER_CIRCLE=5 \
python3 -m maga_transformer.start_server

等服务起来后，直接用下面的命令请求

curl http://localhost:8514/v1/chat/completions -X POST \
  -H 'content-type:application/json' \
  -d '{"temperature": 0.0, "top_p": 0.1, "messages": [{"role": "user", "content": "你是谁"}]}'

然后服务就挂了。 main.log的日志如下

[root][2024-08-22 08:43:02][303769][Thread-9 (run_engine)][/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/decoder_engine.py:step():131][ERROR] process run error: max() arg is an empty sequence, Traceback: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/decoder_engine.py", line 119, in step
    self.executor_.process(batch_query)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/medusa/medusa_model_executor.py", line 146, in process
    medusa_query = self._create_batch_query(batch_query)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/medusa/medusa_model_executor.py", line 30, in _create_batch_query
    max_medusa_length = max([q.seq_length for q in batch_query.streams]) + validate_token_length
ValueError: max() arg is an empty sequence

[root][2024-08-22 08:43:03][303593][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py:multi_rank_start():62][ERROR] some proc is not alive, exit!
[uvicorn.error][2024-08-22 08:43:03][303768][MainThread][/usr/local/lib/python3.10/dist-packages/uvicorn/server.py:shutdown():263][INFO] Shutting down
[uvicorn.error][2024-08-22 08:43:03][303768][MainThread][/usr/local/lib/python3.10/dist-packages/uvicorn/server.py:shutdown():281][INFO] Waiting for connections to close. (CTRL+C to force quit)
[root][2024-08-22 08:43:04][303593][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py:multi_rank_start():62][ERROR] some proc is not alive, exit!
[root][2024-08-22 08:43:05][303593][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py:multi_rank_start():62][ERROR] some proc is not alive, exit!
[root][2024-08-22 08:43:06][303593][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py:multi_rank_start():62][ERROR] some proc is not alive, exit!
[root][2024-08-22 08:43:07][303593][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py:multi_rank_start():62][ERROR] some proc is not alive, exit!
[root][2024-08-22 08:43:08][303593][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py:multi_rank_start():62][ERROR] some proc is not alive, exit!
[root][2024-08-22 08:43:09][303768][Thread-4 (wrapper)][/usr/local/lib/python3.10/dist-packages/maga_transformer/distribute/gang_server.py:_health_check_impl():147][ERROR] Gang server 127.0.1.1 heartbeat loss, do abort
[root][2024-08-22 08:43:09][303593][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/start_server.py:multi_rank_start():62][ERROR] some proc is not alive, exit!

baowendin commented 2 weeks ago

medusa这块我们没支持tp，报错应该是这个原因

wcsjtu commented 2 weeks ago

medusa这块我们没支持tp，报错应该是这个原因

改用tp1启动，报另外的错

[uvicorn.error][2024-08-23 06:31:33][322372][MainThread][/usr/local/lib/python3.10/dist-packages/uvicorn/server.py:_log_started_message():217][INFO] Uvicorn running on http://0.0.0.0:8514 (Press CTRL+C to quit)
[root][2024-08-23 06:32:20][322372][Thread-7 (run_engine)][/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/decoder_engine.py:step():131][ERROR] process run error: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
, Traceback: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/decoder_engine.py", line 119, in step
    self.executor_.process(batch_query)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/medusa/medusa_model_executor.py", line 148, in process
    finished_list, accept_tokens_list, medusa_states_list = self._tree_sample(medusa_query, all_hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/medusa/medusa_model_executor.py", line 102, in _tree_sample
    self._create_medusa_state(logits[bias + batch_query.context_lengths_list[i] - 1: bias + batch_query.context_lengths_list[i]].unsqueeze(1),
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/medusa/medusa_model_executor.py", line 55, in _create_medusa_state
    candidates, tree_candidates = generate_candidates(medusa_logits, logits, self.medusa_buffer.tree_indices, self.medusa_buffer.retrieve_indices, self.medusa_config.top_k)
  File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/medusa/utils.py", line 155, in generate_candidates
    cart_candidates = tree_candidates_ext[retrieve_indices]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[root][2024-08-23 06:32:20][322372][MainThread][/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/decoder_engine.py:_generator_loop_wrap():53][INFO] request_id = 1, exception type = <class 'Exception'>, exception str CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

baowendin commented 2 weeks ago

看上去是在fastertransformer里某个操作报错了，环境变量加上FT_DEBUG_LEVEL=DEBUG可以定位到具体的报错。我们这边会在a100上尝试复现

alibaba / rtp-llm

speculate sampling用medusa加载medusa官方模型报错 #105