Closed guankaisi closed 6 months ago
我用的vllm是 '0.2.1',2个月前我弄的。
root@018222d5ca2c:~/hdd/scaling_sentemb# CUDA_VISIBLE_DEVICES=0 python run_array_decoder_vllm.py --lora /root/hdd/llm/prompteol-opt-2.7b/
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/libbitsandbytes_cuda121.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/cuda/compat/lib'), PosixPath('/usr/local/nvidia/lib')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('//192.168.4.151'), PosixPath('10809')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/libbitsandbytes_cuda121.so...
INFO 01-18 07:06:56 llm_engine.py:72] Initializing an LLM engine with config: model='./temp', tokenizer='/root/hdd/llm/opt-2.7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 01-18 07:07:01 llm_engine.py:207] # GPU blocks: 7603, # CPU blocks: 819
This_passage_:_"*sent_0*"_means_in_one_word:"
Running task: STS17
INFO:mteb.evaluation.MTEB:
## Evaluating 1 tasks:
最好是装0.2.1,这vllm代码跟个迷宫一样,我真没兴趣再弄一遍。
感谢大佬!已经能正常运行,但是我还有一个疑问,此代码是否与huggingface的常规推理相比有所加速?
outputs = self.llm.llm_engine.workers[0].model( #opt
input_ids=input_tokens,
positions=input_positions,
kv_caches=[(None, None)] * num_layers,
input_metadata=input_metadata,
)
以上代码中,设置kv_caches=[(None, None)] * num_layers,kv_caches都是None,是否就是与普通huggingface推理等价?经过我的实验,此代码好像没有太利用vllm的page-attention加速功能
我自己测试是快2倍,而且embedding就算一次Next word prediction,等于说每次输入都不一样。Kv cache用不上吧?
直接使用Worker直接在新版本上使用
理论上获得了一样的结果,待整理成mteb代码
@guankaisi 好了哦~看ipynb,其他直接上api就行了。
此行代码会报错
查看vllm源代码,发现当前版本worker.py文件中,
并没有model这一项,想询问一下作者,您用的vllm的版本是多少?