AttributeError: 'Worker' object has no attribute 'model'

guankaisi commented 9 months ago

此行代码会报错

  File "/data4/kaisi/RETA-LLM/indexer/index_baichuan.py", line 116, in build_model
    model = self.llm.llm_engine.workers[0].model
AttributeError: 'Worker' object has no attribute 'model'

查看vllm源代码，发现当前版本worker.py文件中，

class Worker:
    """A worker class that executes (a partition of) the model on a GPU.

    Each worker is associated with a single GPU. The worker is responsible for
    maintaining the KV cache and executing the model on the GPU. In case of
    distributed inference, each worker is assigned a partition of the model.
    """

    def __init__(
        self,
        model_config: ModelConfig,
        parallel_config: ParallelConfig,
        scheduler_config: SchedulerConfig,
        local_rank: int,
        rank: int,
        distributed_init_method: str,
        is_driver_worker: bool = False,
    ) -> None:
        self.model_config = model_config
        self.parallel_config = parallel_config
        self.scheduler_config = scheduler_config
        self.local_rank = local_rank
        self.rank = rank
        self.distributed_init_method = distributed_init_method
        self.is_driver_worker = is_driver_worker
        if self.is_driver_worker:
            assert self.rank == 0, "The driver worker must have rank 0."

        self.model_runner = ModelRunner(model_config, parallel_config,
                                        scheduler_config, is_driver_worker)
        # Uninitialized cache engine. Will be initialized by
        # self.init_cache_engine().
        self.cache_config = None
        self.cache_engine = None
        self.cache_events = None
        self.gpu_cache = None

并没有model这一项，想询问一下作者，您用的vllm的版本是多少？

WuNein commented 9 months ago

我用的vllm是 '0.2.1'，2个月前我弄的。

root@018222d5ca2c:~/hdd/scaling_sentemb# CUDA_VISIBLE_DEVICES=0 python run_array_decoder_vllm.py --lora /root/hdd/llm/prompteol-opt-2.7b/
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/libbitsandbytes_cuda121.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/cuda/compat/lib'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('//192.168.4.151'), PosixPath('10809')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/libbitsandbytes_cuda121.so...
INFO 01-18 07:06:56 llm_engine.py:72] Initializing an LLM engine with config: model='./temp', tokenizer='/root/hdd/llm/opt-2.7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 01-18 07:07:01 llm_engine.py:207] # GPU blocks: 7603, # CPU blocks: 819
This_passage_:_"*sent_0*"_means_in_one_word:"
Running task:  STS17
INFO:mteb.evaluation.MTEB:

## Evaluating 1 tasks:

参考这个 https://github.com/vllm-project/vllm/blob/651c614aa43e497a2e2aab473493ba295201ab20/vllm/worker/worker.py#L111

WuNein commented 9 months ago

最好是装0.2.1，这vllm代码跟个迷宫一样，我真没兴趣再弄一遍。

guankaisi commented 9 months ago

感谢大佬！已经能正常运行，但是我还有一个疑问，此代码是否与huggingface的常规推理相比有所加速？

outputs = self.llm.llm_engine.workers[0].model( #opt
                input_ids=input_tokens,
                positions=input_positions,
                kv_caches=[(None, None)] * num_layers,
                input_metadata=input_metadata,
            )

以上代码中，设置kv_caches=[(None, None)] * num_layers，kv_caches都是None，是否就是与普通huggingface推理等价？经过我的实验，此代码好像没有太利用vllm的page-attention加速功能

WuNein commented 9 months ago

我自己测试是快2倍，而且embedding就算一次Next word prediction，等于说每次输入都不一样。Kv cache用不上吧？

WuNein commented 8 months ago

直接使用Worker直接在新版本上使用

WuNein commented 8 months ago

理论上获得了一样的结果，待整理成mteb代码

WuNein commented 6 months ago

https://github.com/vllm-project/vllm/pull/3734

WuNein commented 5 months ago

@guankaisi 好了哦~看ipynb，其他直接上api就行了。

WuNein / vllm4mteb

AttributeError: 'Worker' object has no attribute 'model' #1