flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.63k stars 223 forks source link

Issue with FlexFlow LLM Compilation and Generation #1444

Open QAZWSX0827 opened 1 month ago

QAZWSX0827 commented 1 month ago

Hello,

I am encountering an issue while testing FlexFlow's LLM module. Below is the code I am using: `import flexflow.serve as ff import time

ff.init( num_gpus=1, memory_per_gpu=22000, zero_copy_memory_per_node=30000, tensor_parallelism_degree=1, pipeline_parallelism_degree=1 )

llm = ff.LLM("/data/lich/llama-7b-hf")

llm = ff.LLM("/home/wutong/meta-llama/Llama-2-7b-hf") ssms = []

Specify a list of SSMs

test without ssms

ssm = ff.SSM("/data/lich/llama-160m")

ssm = ff.SSM("/home/wutong/JackFram/llama-160m") ssms.append(ssm)

generation_config = ff.GenerationConfig( do_sample=False, temperature=0.9, topp=0.8, topk=1 )

for ssm in ssms: ssm.compile(generation_config)

llm.compile(generation_config, ssms=ssms)

test data comes from WebQA

prompts = [ "what is the name of justin bieber brother?", "what character did natalie portman play in star wars?", "what state does selena gomez?", "what country is the grand bahama island in?", "what kind of money to take to bahamas?", "what character did john noble play in lord of the rings?", "who does joakim noah play for?", "where are the nfl redskins from?", "where did saki live?" ]

start_time = time.time() result = llm.generate(prompts) print("--- %s seconds ---" % (time.time() - start_time)) `

When I run this script, I encounter the following problem: [0 - 7ff3727884c0] 0.372910 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.372966 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.372980 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.372991 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.373003 {3}{Mapper}: Enabled Control Replication Optimizations. workSpaceSize (128 MB) /home/wutong/anaconda3/envs/SpecInfer/lib/python3.8/site-packages/torch/__init__.py:749: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:431.) _C._set_default_tensor_type(t) Creating directory /home/wutong/jackfram/llama-160m/half-precision (if it doesn't exist)... Loading '/home/wutong/JackFram/llama-160m' model weights from the cache... Loading tokenizer... Loading '/home/wutong/JackFram/llama-160m' tokenizer from the cache... python: /tmp/pip-install-z0y94xhd/flexflow_8e08f707683c4cf9af720b1434f7fc8a/src/runtime/request_manager.cc:61: void FlexFlow::RequestManager::set_max_requests_per_batch(int): Assertionmax_requests_per_batch == -1 || max_requests_per_batch == max_num_requests' failed. Aborted (core dumped)`

Can you tell me what the problem might be? Any help or suggestions would be greatly appreciated.

QAZWSX0827 commented 1 month ago

demo.txt It seems that there was a formatting error. The code is attached in the attachment.