flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.73k stars 233 forks source link

FlexFlow performance test #1240

Open 1193749292 opened 12 months ago

1193749292 commented 12 months ago

I compared the performance of flexflow to vllm.

Environment setup: Docker used by FlexFlow: docker run --gpus all -dit --name flexflow -v /path/:/path/ --rm -e NVIDIA_VISIBLE_DEVICES=1 --runtime=nvidia --ipc=host --shm-size=8g ghcr.io/flexflow/flexflow-cuda-12.0:latest vllm: Run the pip install vllm command after the ghcr.io/flexflow/flexflow-cuda-12.0:latest image is used to generate a container.

flexflow demo:

ff.init(
    num_gpus=4,
    memory_per_gpu=30000,
    zero_copy_memory_per_node=30000,
    tensor_parallelism_degree=4,
    pipeline_parallelism_degree=1,
    )
ffconfig = FFConfig()
ts_start = ffconfig.get_current_time()
llm = ff.LLM("/path/facebook/opt-13b")
ssms=[]
ssm = ff.SSM("/path/facebook/opt-125m")
ssms.append(ssm)
generation_config = ff.GenerationConfig(
    temperature=0.8, topp=0.95
    )
for ssm in ssms:
    ssm.compile(generation_config)
llm.compile(generation_config)
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
result = llm.generate(prompts)
ts_end = ffconfig.get_current_time()

vllm demo: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py

When FlexFlow runs for the first time, some weights are downloaded. When FlexFlow runs for the second time, modify FlexFlow/python/flexflow/serve/serve.py so that FlexFlow skips this process and obtains the duration.

Here are my simple test results: image

Can you help me see what the problem is?

zym1599 commented 11 months ago

How did you modify the serve.py file so that it skips loading, please? thank you a lot

1193749292 commented 11 months ago

After the first execution, I rudely deleted lines 201~226 from serve.py.

https://github.com/flexflow/FlexFlow/issues/1236

A better way is to follow goliaro's approach, using llm = ff.LLM("facebook/opt-6.7b")

zym1599 commented 11 months ago

Thank you very much. I've downloaded the model locally and it's a pain in the ass that he has to retrieve huggingface every time.

zym1599 commented 11 months ago

image I've downloaded the weights locally, but I still have to show this every time.

1193749292 commented 11 months ago

I'm sorry to keep you waiting so long, https://github.com/flexflow/FlexFlow/issues/1236 I can't reproduce this situation right now.

From the printing, this is the only place where this information is printed. image Deleting lines 201–226 will not print. Are you modifying serve.py in the installed flexflow? eg: /path/conda/lib/python3.11/site-packages/flexflow/serve/serve.py

jiazhihao commented 10 months ago

Hi everyone, this issue should have been fixed by PR #1223 . You can control the number of concurrent requests by passing max_requests_per_batch as an input argument when compiling the LLM.