flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.7k stars 226 forks source link

Performance Issue #1377

Open lethean287 opened 6 months ago

lethean287 commented 6 months ago

Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seems abnormal. For OPT-13B, we meet core dump error on several inference datasets. Our execution process is as follows: We first set up the environment by directly use the docker you provided (ghcr.io/flexflow/flexflow-cuda-11.8:latest), and build from source following your instructions.

We attempt to use flexflow inference by running the following command, but encountered an issue with core dump.

python -u run.py \
--num_gpus 4\
--memory_per_gpu 78000\
--zero_copy_memory_per_node 200000 \ 
--tensor_parallelism_degree 4\
--pipeline_parallelism_degree 1\
--max_requests_per_batch 8\
--max_seq_length 128\
--max_tokens_per_batch 1024\
--llm facebook/opt-13b \
--ssm facebook/opt-125m \
--prompts_file prompts/dialogue.jsonSpecially, run.py is the file we write following the Quickstart guidance in the repo.

Specially, run.py is the file we write following the Quickstart guidance in the repo.

import flexflow.serve as ff
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_gpus', default=2, type=int)
    parser.add_argument('--memory_per_gpu', default=38000, type=int)
    parser.add_argument('--zero_copy_memory_per_node', default=30000, type=int)
    parser.add_argument('--tensor_parallelism_degree', default=2, type=int)
    parser.add_argument('--pipeline_parallelism_degree', default=1, type=int)
    parser.add_argument('--llm', default='facebook/opt-125m', type=str)
    parser.add_argument('--ssm', default='facebook/opt-125m', type=str)
    parser.add_argument('--prompts_file', default='prompts/Alpaca.json', type=str)
    parser.add_argument('--max_requests_per_batch', default=16, type=int)
    parser.add_argument('--max_seq_length', default=128, type=int)
    parser.add_argument('--max_tokens_per_batch', default=128, type=int)
    args = parser.parse_args()

    os.environ['TRANSFORMERS_OFFLINE'] = '1'

    ff.init(num_gpus=args.num_gpus,
            memory_per_gpu=args.memory_per_gpu,
            zero_copy_memory_per_node=args.zero_copy_memory_per_node,
            tensor_parallelism_degree=args.tensor_parallelism_degree,
            pipeline_parallelism_degree=args.pipeline_parallelism_degree
        )

    #pdb.set_trace()

    # Specify the LLM
    llm = ff.LLM(args.llm)

    # Specify a list of SSMs (just one in this case)
    ssms=[]
    if args.ssm != '':
        ssm_names = args.ssm.split(',')
        for ssm_name in ssm_names:
            ssm = ff.SSM(ssm_name)
            ssms.append(ssm)

    # Create the sampling configs
    generation_config = ff.GenerationConfig(
        do_sample=False, temperature=0, topp=1, topk=1
    )

    # Compile the SSMs for inference and load the weights into memory
    for ssm in ssms:
        ssm.compile(generation_config,
                    max_requests_per_batch=args.max_requests_per_batch,
                    max_seq_length=args.max_seq_length,
                    max_tokens_per_batch=args.max_tokens_per_batch)

    # Compile the LLM for inference and load the weights into memory
    llm.compile(generation_config, 
                ssms=ssms,
                max_requests_per_batch=args.max_requests_per_batch,
                max_seq_length=args.max_seq_length,
                max_tokens_per_batch=args.max_tokens_per_batch
               )

    # load prompts
    with open(args.prompts_file, 'r') as f:
        prompts = json.load(f)

    llm.start_server()
    result = llm.generate(prompts=prompts)
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the total inference time to process all requests in the chatbot dataset using vLLM and SpecInfer respectively. We first test the Llama2-70B-chat model with llama-160M you provided as SSM. The results are as follows: inference time vLLM(s) SpecInfer(s)
BS=1 1022.952185869 1550.611874
BS=2 529.516379833 800.023607
BS=4 275.700631380 408.75528
BS=8 144.448794603 236.409383
BS=16 76.175143718 133.675686
BS=32 42.816745996 95.503888

And it seems that the performance of vLLM is better than SpecInfer. Moreover, we have also run OPT-13B with OPT-125M as SSM on several datasets including dialogue dataset, but meet the core dump error: core_dump

All the datasets mentioned above are here: https://github.com/lethean287/dataset_0421 Any help to solve this issue is appreciated!

QAZWSX0827 commented 3 months ago

Hello, have you successfully reproduced the results of SpecInfer?