lxnlxnlxnlxnlxn commented 8 months ago

LightLLM运行过程

复现kvoff分支

第一步：创建docker

拉取镜像：docker pull ghcr.io/modeltc/lightllm:main

llama-7b模型过大，在服务器的docker中直接clone总是发生网络中断，因此我将该模型下载到本地，通过Xftp传输到服务器中，而后在创建docker时将模型文件夹映射到lightllm源码的models文件夹中。

模型仓库：[huggyllama/llama-7b · Hugging Face](https://huggingface.co/huggyllama/llama-7b)

docker run -itd --ipc=host --net=host  --name lxn_lightllm --gpus all -p 8080:8080 -v /hdd/lxn/llama-7b:/lightllm/lightllm/models/llama-7b ghcr.io/modeltc/lightllm:main /bin/bash

第二步：运行

源码安装：

python setup.py install

模型运行：

python -m lightllm.server.api_server --model_dir models/llama-7b --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 120000

错误信息——OOM

load model error: CUDA out of memory. Tried to allocate 938.00 MiB (GPU 0; 31.75 GiB total capacity; 30.87 GiB already allocated; 97.94 MiB free; 30.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF CUDA out of memory. Tried to allocate 938.00 MiB (GPU 0; 31.75 GiB total capacity; 30.87 GiB already allocated; 97.94 MiB free; 30.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF <class 'torch.cuda.OutOfMemoryError'>
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 257, in start_router_process
    asyncio.run(router.wait_to_model_ready())
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 62, in wait_to_model_ready
    await asyncio.gather(*init_model_ret)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/model_infer/model_rpc.py", line 229, in init_model
    ans : rpyc.AsyncResult = self._init_model(rank_id, world_size, weight_dir, max_total_token_num, load_way, mode)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/model_infer/model_rpc.py", line 97, in exposed_init_model
    raise e
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/model_infer/model_rpc.py", line 68, in exposed_init_model
    self.model = LlamaTpPartModel(rank_id, world_size, weight_dir, max_total_token_num, load_way, mode)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/models/llama/model.py", line 35, in __init__
    super().__init__(tp_rank, world_size, weight_dir, max_total_token_num, load_way, mode, weight_dict, finetune_config)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/common/basemodel/basemodel.py", line 40, in __init__
    self._init_mem_manager()
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/models/llama/model.py", line 56, in _init_mem_manager
    self.mem_manager = self.memory_manager_class(self.max_total_token_num,
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/common/mem_manager.py", line 10, in __init__
    self._init_buffers(size, dtype, head_num, head_dim, layer_num)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/common/mem_manager.py", line 14, in _init_buffers
    self.key_buffer = [torch.empty((size, head_num, head_dim), dtype=dtype, device="cuda") for _ in range(layer_num)]
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/common/mem_manager.py", line 14, in <listcomp>
    self.key_buffer = [torch.empty((size, head_num, head_dim), dtype=dtype, device="cuda") for _ in range(layer_num)]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 938.00 MiB (GPU 0; 31.75 GiB total capacity; 30.87 GiB already allocated; 97.94 MiB free; 30.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 260, in start_router_process
    err_str = '\n'.join(traceback.format_exception(e))
TypeError: format_exception() missing 2 required positional arguments: 'value' and 'tb'

后将max_total_token_num的值从120000改为6000，OOM错误消失，但有发生了下面的错误（每次总是在下面三种错误中随机出现一种）。在Google上搜索了类似错误，但并没有解决。

1

Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 270, in start_router_process
    loop.run_until_complete(router.loop_for_netio_req())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 221, in loop_for_netio_req
    recv_req = await self.recv_from_httpserver.recv_pyobj()
  File "/opt/conda/lib/python3.9/site-packages/zmq/_future.py", line 356, in _chain
    loaded = load(buf)
_pickle.UnpicklingError: could not find MARK

2

Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 270, in start_router_process
    loop.run_until_complete(router.loop_for_netio_req())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 221, in loop_for_netio_req
    recv_req = await self.recv_from_httpserver.recv_pyobj()
  File "/opt/conda/lib/python3.9/site-packages/zmq/_future.py", line 356, in _chain
    loaded = load(buf)
_pickle.UnpicklingError: invalid load key, 'n'.

3

Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 270, in start_router_process
    loop.run_until_complete(router.loop_for_netio_req())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.9/site-packages/lightllm-1.0.0-py3.9.egg/lightllm/server/router/manager.py", line 221, in loop_for_netio_req
    recv_req = await self.recv_from_httpserver.recv_pyobj()
  File "/opt/conda/lib/python3.9/site-packages/zmq/_future.py", line 356, in _chain
    loaded = load(buf)
_pickle.UnpicklingError: invalid load key, '"'

api_server无法运行：

api_server无法运行

PannenetsF commented 8 months ago

这个分支没有进行server的测试，可以看看跑test有没有问题

lxnlxnlxnlxnlxn commented 8 months ago

那请问目前是只能运行Readme中的Static inference performance部分嘛（kvoff分支）？

PannenetsF commented 8 months ago

是的，因为serving性能受限，我们没有进一步实现

---原始邮件--- 发件人: @.> 发送时间: 2023年11月1日(周三) 晚上8:03 收件人: @.>; 抄送: "Yunqian @.**@.>; 主题: Re: [ModelTC/lightllm] 源码复现过程中出现很多问题 (Issue #187)

那请问目前是只能运行Readme中的Static inference performance部分嘛（kvoff分支）？

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

lxnlxnlxnlxnlxn commented 8 months ago

我在Huggingface官网上下载了Chinese-LLaMA-2-1.3B模型，而后运行

test/model/test_llama2.py，得到以下报错：
root@gpu0:/lightllm/test/model# python test_llama2.py
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
F
======================================================================
FAIL: test_llama2_infer (__main__.TestLlama2Infer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lightllm/test/model/test_llama2.py", line 11, in test_llama2_infer
    test_model_inference(world_size=1,
  File "/lightllm/test/model/model_infer.py", line 16, in test_model_inference
    assert not ans_queue.empty()
AssertionError

----------------------------------------------------------------------
Ran 1 test in 9.372s

FAILED (failures=1)

此问题的与这个issue类似，但这个issue里并没有详细的解决方案

ModelTC / lightllm

源码复现过程中出现很多问题 #187

LightLLM运行过程

第一步：创建docker

第二步：运行

错误信息——OOM

1

2

3