LLMServe / DistServe

Disaggregated serving system for Large Language Models (LLMs).
Apache License 2.0
114 stars 9 forks source link

Decode Wrong Token #16

Open sitabulaixizawaluduo opened 1 week ago

sitabulaixizawaluduo commented 1 week ago

model: Llama-2-7b-hf step: 1、python3 converter.py --input "Llama-2-7b-hf/*.bin"--output /datasets/distserve/llama-7b --dtype float16 --model llama 2、python3 api_server/distserve_api_server.py --port 6902 --model /datasets/distserve/llama-7b --context-tensor-parallel-size 1 --decoding-tensor-parallel-size 1 3、python3 evaluation/2-benchmark-serving/0-prepare-dataset.py --dataset-path Sharegpt 4、python3 evaluation/2-benchmark-serving/2-benchmark-serving.py --port 6902

the error message: image SwiftTransformer/src/csrc/model/gpt/gpt.cc:278 'cudaMemcpy(ith_context_req_req_index.ptr, ith_context_req_req_index_cpu, sizeof(int64_t) * batch_size, cudaMemcpyHostToDevice)': (700) an illegal memory access was encountered

PKUFlyingPig commented 1 week ago

For llama2, you do not need to download the weights yourself. Just launch the api_server with --model meta-llama/Llama-2-7b-hf (the name matches the official name on huggingface), distserve will download and convert the weights for you.

sitabulaixizawaluduo commented 1 week ago

For llama2, you do not need to download the weights yourself. Just launch the api_server with --model meta-llama/Llama-2-7b-hf (the name matches the official name on huggingface), distserve will download and convert the weights for you.

Is there a difference between the two method? the LLama model which I used is also download from huggingface

PKUFlyingPig commented 1 week ago

You may refer to the downloader code to see if you have missed some details during convertering.