Open 1193749292 opened 12 months ago
How did you modify the serve.py file so that it skips loading, please? thank you a lot
After the first execution, I rudely deleted lines 201~226 from serve.py.
https://github.com/flexflow/FlexFlow/issues/1236
A better way is to follow goliaro's approach, using llm = ff.LLM("facebook/opt-6.7b")
Thank you very much. I've downloaded the model locally and it's a pain in the ass that he has to retrieve huggingface every time.
I've downloaded the weights locally, but I still have to show this every time.
I'm sorry to keep you waiting so long, https://github.com/flexflow/FlexFlow/issues/1236 I can't reproduce this situation right now.
From the printing, this is the only place where this information is printed.
Deleting lines 201–226 will not print. Are you modifying serve.py
in the installed flexflow?
eg: /path/conda/lib/python3.11/site-packages/flexflow/serve/serve.py
Hi everyone, this issue should have been fixed by PR #1223 . You can control the number of concurrent requests by passing max_requests_per_batch
as an input argument when compiling the LLM.
I compared the performance of flexflow to vllm.
Environment setup: Docker used by FlexFlow: docker run --gpus all -dit --name flexflow -v /path/:/path/ --rm -e NVIDIA_VISIBLE_DEVICES=1 --runtime=nvidia --ipc=host --shm-size=8g ghcr.io/flexflow/flexflow-cuda-12.0:latest vllm: Run the pip install vllm command after the ghcr.io/flexflow/flexflow-cuda-12.0:latest image is used to generate a container.
flexflow demo:
vllm demo: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py
When FlexFlow runs for the first time, some weights are downloaded. When FlexFlow runs for the second time, modify FlexFlow/python/flexflow/serve/serve.py so that FlexFlow skips this process and obtains the duration.
Here are my simple test results:
Can you help me see what the problem is?