Hello,

I am encountering an issue while testing FlexFlow's LLM module. Below is the code I am using: `import flexflow.serve as ff import time

ff.init( num_gpus=1, memory_per_gpu=22000, zero_copy_memory_per_node=30000, tensor_parallelism_degree=1, pipeline_parallelism_degree=1 )

llm = ff.LLM("/data/lich/llama-7b-hf")

llm = ff.LLM("/home/wutong/meta-llama/Llama-2-7b-hf") ssms = []

Specify a list of SSMs

test without ssms

ssm = ff.SSM("/data/lich/llama-160m")

ssm = ff.SSM("/home/wutong/JackFram/llama-160m") ssms.append(ssm)

generation_config = ff.GenerationConfig( do_sample=False, temperature=0.9, topp=0.8, topk=1 )

for ssm in ssms: ssm.compile(generation_config)

llm.compile(generation_config, ssms=ssms)

test data comes from WebQA

prompts = [ "what is the name of justin bieber brother?", "what character did natalie portman play in star wars?", "what state does selena gomez?", "what country is the grand bahama island in?", "what kind of money to take to bahamas?", "what character did john noble play in lord of the rings?", "who does joakim noah play for?", "where are the nfl redskins from?", "where did saki live?" ]

start_time = time.time() result = llm.generate(prompts) print("--- %s seconds ---" % (time.time() - start_time)) `

When I run this script, I encounter the following problem: [0 - 7ff3727884c0] 0.372910 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.372966 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.372980 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.372991 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7ff3727884c0] 0.373003 {3}{Mapper}: Enabled Control Replication Optimizations. workSpaceSize (128 MB) /home/wutong/anaconda3/envs/SpecInfer/lib/python3.8/site-packages/torch/__init__.py:749: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:431.) _C._set_default_tensor_type(t) Creating directory /home/wutong/jackfram/llama-160m/half-precision (if it doesn't exist)... Loading '/home/wutong/JackFram/llama-160m' model weights from the cache... Loading tokenizer... Loading '/home/wutong/JackFram/llama-160m' tokenizer from the cache... python: /tmp/pip-install-z0y94xhd/flexflow_8e08f707683c4cf9af720b1434f7fc8a/src/runtime/request_manager.cc:61: void FlexFlow::RequestManager::set_max_requests_per_batch(int): Assertionmax_requests_per_batch == -1 || max_requests_per_batch == max_num_requests' failed. Aborted (core dumped)`

Can you tell me what the problem might be? Any help or suggestions would be greatly appreciated.

flexflow / FlexFlow

Issue with FlexFlow LLM Compilation and Generation #1444

llm = ff.LLM("/data/lich/llama-7b-hf")

Specify a list of SSMs

test without ssms

ssm = ff.SSM("/data/lich/llama-160m")

test data comes from WebQA