Performance Issue - Githubissues

Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seems abnormal. For OPT-13B, we meet core dump error on several inference datasets. Our execution process is as follows: We first set up the environment by directly use the docker you provided (ghcr.io/flexflow/flexflow-cuda-11.8:latest), and build from source following your instructions.

We attempt to use flexflow inference by running the following command, but encountered an issue with core dump.

python -u run.py \
--num_gpus 4\
--memory_per_gpu 78000\
--zero_copy_memory_per_node 200000 \ 
--tensor_parallelism_degree 4\
--pipeline_parallelism_degree 1\
--max_requests_per_batch 8\
--max_seq_length 128\
--max_tokens_per_batch 1024\
--llm facebook/opt-13b \
--ssm facebook/opt-125m \
--prompts_file prompts/dialogue.jsonSpecially, run.py is the file we write following the Quickstart guidance in the repo.

Specially, run.py is the file we write following the Quickstart guidance in the repo.

import flexflow.serve as ff
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_gpus', default=2, type=int)
    parser.add_argument('--memory_per_gpu', default=38000, type=int)
    parser.add_argument('--zero_copy_memory_per_node', default=30000, type=int)
    parser.add_argument('--tensor_parallelism_degree', default=2, type=int)
    parser.add_argument('--pipeline_parallelism_degree', default=1, type=int)
    parser.add_argument('--llm', default='facebook/opt-125m', type=str)
    parser.add_argument('--ssm', default='facebook/opt-125m', type=str)
    parser.add_argument('--prompts_file', default='prompts/Alpaca.json', type=str)
    parser.add_argument('--max_requests_per_batch', default=16, type=int)
    parser.add_argument('--max_seq_length', default=128, type=int)
    parser.add_argument('--max_tokens_per_batch', default=128, type=int)
    args = parser.parse_args()

    os.environ['TRANSFORMERS_OFFLINE'] = '1'

    ff.init(num_gpus=args.num_gpus,
            memory_per_gpu=args.memory_per_gpu,
            zero_copy_memory_per_node=args.zero_copy_memory_per_node,
            tensor_parallelism_degree=args.tensor_parallelism_degree,
            pipeline_parallelism_degree=args.pipeline_parallelism_degree
        )

    #pdb.set_trace()

    # Specify the LLM
    llm = ff.LLM(args.llm)

    # Specify a list of SSMs (just one in this case)
    ssms=[]
    if args.ssm != '':
        ssm_names = args.ssm.split(',')
        for ssm_name in ssm_names:
            ssm = ff.SSM(ssm_name)
            ssms.append(ssm)

    # Create the sampling configs
    generation_config = ff.GenerationConfig(
        do_sample=False, temperature=0, topp=1, topk=1
    )

    # Compile the SSMs for inference and load the weights into memory
    for ssm in ssms:
        ssm.compile(generation_config,
                    max_requests_per_batch=args.max_requests_per_batch,
                    max_seq_length=args.max_seq_length,
                    max_tokens_per_batch=args.max_tokens_per_batch)

    # Compile the LLM for inference and load the weights into memory
    llm.compile(generation_config, 
                ssms=ssms,
                max_requests_per_batch=args.max_requests_per_batch,
                max_seq_length=args.max_seq_length,
                max_tokens_per_batch=args.max_tokens_per_batch
               )

    # load prompts
    with open(args.prompts_file, 'r') as f:
        prompts = json.load(f)

    llm.start_server()
    result = llm.generate(prompts=prompts)

We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the total inference time to process all requests in the chatbot dataset using vLLM and SpecInfer respectively. We first test the Llama2-70B-chat model with llama-160M you provided as SSM. The results are as follows:	inference time	vLLM(s)
BS=1	1022.952185869	1550.611874
BS=2	529.516379833	800.023607
BS=4	275.700631380	408.75528
BS=8	144.448794603	236.409383
BS=16	76.175143718	133.675686
BS=32	42.816745996	95.503888

And it seems that the performance of vLLM is better than SpecInfer. Moreover, we have also run OPT-13B with OPT-125M as SSM on several datasets including dialogue dataset, but meet the core dump error: core_dump

All the datasets mentioned above are here: https://github.com/lethean287/dataset_0421 Any help to solve this issue is appreciated!

flexflow / FlexFlow

Performance Issue #1377