Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seems abnormal. For OPT-13B, we meet core dump error on several inference datasets.
Our execution process is as follows:
We first set up the environment by directly use the docker you provided (ghcr.io/flexflow/flexflow-cuda-11.8:latest), and build from source following your instructions.
We attempt to use flexflow inference by running the following command, but encountered an issue with core dump.
python -u run.py \
--num_gpus 4\
--memory_per_gpu 78000\
--zero_copy_memory_per_node 200000 \
--tensor_parallelism_degree 4\
--pipeline_parallelism_degree 1\
--max_requests_per_batch 8\
--max_seq_length 128\
--max_tokens_per_batch 1024\
--llm facebook/opt-13b \
--ssm facebook/opt-125m \
--prompts_file prompts/dialogue.jsonSpecially, run.py is the file we write following the Quickstart guidance in the repo.
Specially, run.py is the file we write following the Quickstart guidance in the repo.
import flexflow.serve as ff
import argparse
import json
import os
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--num_gpus', default=2, type=int)
parser.add_argument('--memory_per_gpu', default=38000, type=int)
parser.add_argument('--zero_copy_memory_per_node', default=30000, type=int)
parser.add_argument('--tensor_parallelism_degree', default=2, type=int)
parser.add_argument('--pipeline_parallelism_degree', default=1, type=int)
parser.add_argument('--llm', default='facebook/opt-125m', type=str)
parser.add_argument('--ssm', default='facebook/opt-125m', type=str)
parser.add_argument('--prompts_file', default='prompts/Alpaca.json', type=str)
parser.add_argument('--max_requests_per_batch', default=16, type=int)
parser.add_argument('--max_seq_length', default=128, type=int)
parser.add_argument('--max_tokens_per_batch', default=128, type=int)
args = parser.parse_args()
os.environ['TRANSFORMERS_OFFLINE'] = '1'
ff.init(num_gpus=args.num_gpus,
memory_per_gpu=args.memory_per_gpu,
zero_copy_memory_per_node=args.zero_copy_memory_per_node,
tensor_parallelism_degree=args.tensor_parallelism_degree,
pipeline_parallelism_degree=args.pipeline_parallelism_degree
)
#pdb.set_trace()
# Specify the LLM
llm = ff.LLM(args.llm)
# Specify a list of SSMs (just one in this case)
ssms=[]
if args.ssm != '':
ssm_names = args.ssm.split(',')
for ssm_name in ssm_names:
ssm = ff.SSM(ssm_name)
ssms.append(ssm)
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=False, temperature=0, topp=1, topk=1
)
# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
ssm.compile(generation_config,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
ssms=ssms,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch
)
# load prompts
with open(args.prompts_file, 'r') as f:
prompts = json.load(f)
llm.start_server()
result = llm.generate(prompts=prompts)
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the total inference time to process all requests in the chatbot dataset using vLLM and SpecInfer respectively. We first test the Llama2-70B-chat model with llama-160M you provided as SSM. The results are as follows:
inference time
vLLM(s)
SpecInfer(s)
BS=1
1022.952185869
1550.611874
BS=2
529.516379833
800.023607
BS=4
275.700631380
408.75528
BS=8
144.448794603
236.409383
BS=16
76.175143718
133.675686
BS=32
42.816745996
95.503888
And it seems that the performance of vLLM is better than SpecInfer.
Moreover, we have also run OPT-13B with OPT-125M as SSM on several datasets including dialogue dataset, but meet the core dump error:
Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seems abnormal. For OPT-13B, we meet core dump error on several inference datasets. Our execution process is as follows: We first set up the environment by directly use the docker you provided (ghcr.io/flexflow/flexflow-cuda-11.8:latest), and build from source following your instructions.
We attempt to use flexflow inference by running the following command, but encountered an issue with core dump.
Specially, run.py is the file we write following the Quickstart guidance in the repo.
And it seems that the performance of vLLM is better than SpecInfer. Moreover, we have also run OPT-13B with OPT-125M as SSM on several datasets including dialogue dataset, but meet the core dump error:
All the datasets mentioned above are here: https://github.com/lethean287/dataset_0421 Any help to solve this issue is appreciated!