Note: currently, this doc is for onboarding the new developers. Will have a separate README in the future for general audiences.
It's recommended to create a new folder before cloning the repository. The final file structure will look like as follows:
<parent-folder>/
|--- lmcache-test/
|--- LMCache/
|--- lmcache-vllm/
# Create conda environment
conda create -n lmcache python=3.10
conda activate lmcache
# Clone github repository
git clone git@github.com:LMCache/lmcache-tests.git
cd lmcache-tests
# Run the installation script
bash prepare_environment.sh
The following command line runs the test test_lmcache_local_cpu
defined in tests/tests.py
and write the output results to the output folder (outputs/test_lmcache_local_cpu.csv
).
python3 main.py tests/tests.py -f test_lmcache_local_cpu -o outputs/
To process the result, please run
cd outputs/
python3 process_result.py
Then, a pdf file test_lmcache_local_cpu.pdf
will be created.
You can also monitor the following files to check the status of the bootstrapped vllm process.
For stderr:
tail -f /tmp/8000-stderr.log
For stdout:
tail -f /tmp/8000-stdout.log
main.py
is the entrypoint to execute the test functions:
usage: main.py [-h] [-f FILTER] [-l] [-o OUTPUT_DIR] [-m MODEL] filepath
Execute all functions in a given Python file.
positional arguments:
filepath The Python file to execute functions from (include subfolders if any).
options:
-h, --help show this help message and exit
-f FILTER, --filter FILTER
Pattern to filter which functions to execute.
-l, --list List all functions in the module without executing them.
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
The directory to put the output file.
-m MODEL, --model MODEL
The models of vllm for every functions.
Here are some examples:
# Run all the test functions defined in 'tests/tests.py' and save the output to 'outputs/'
python3 main.py tests/tests.py -o outputs/
# List the tests in 'tests/tests.py'
python3 main.py tests/tests.py -l
# Run some specific tests that match the given pattern (e.g., containing 'cachegen')
python3 main.py tests/tests.py -f cachegen
# Run all the test functions defined in 'tests/tests.py' with llama
python3 main.py tests/tests.py -m "meta-llama/Llama-3.1-8B-Instruct"
In general, each test function should output the results as a csv file, where the file name is the same as function name but with a csv
suffix.
There should be multiple columns in the CSV:
expr_id
: the id of the experimenttimestamp
: the timestamp of the request in the workloadengine_id
: the id of the serving engine instance to which the request is sentrequest_id
: the id of the request in the workloadTTFT
: the time-to-first-token of this requestthroughput
: the number of output tokens per second of this requestcontext_len
: the context length of this requestquery_len
: the query length of this requestgpu_memory
: the gpu memory used before this experiment is finishedSome example codes of how to parse the output CSV can be found in outputs/process_result.py
.
Request
: A request is a single prompt containing some context and a user query.Engine
: Means "serving engine". An engine is an LLM serving engine process that can process requests through the OpenAI API interface.Workload
: A workload is a list of requests at different timestamps. Usually, a workload is associated with only one engine.Experiment
: An experiment runs multiple engines simultaneously and sends the workloads generated from the same configuration to the serving engines.Test case
: A test case contains a list of experiments. The goal is to compare the performance of different engines under different workloads. Test function
: A test function wraps the test cases and saves the output CSV. In most cases, a single test function only contains one test case. Different test functions aim to test different functionalities of the system.Test case configuration:
Test case configuration controls the experiments to run. The configuration-related code can be found in config.py.
Currently, we support the following configurations:
During the experiment, workload configuration will be used to generate the workloads, and vLLM configuration + LMCache configuration will be used to start the engine.
Workload generator:
The workload generator takes in a workload configuration and generates the workload (i.e., a list of requests at different timestamps) as the output. The code for the workload generator can be found in workload.py.
By design, there could be multiple different kinds of workload generators for different use cases, such as chatbot, QA, or RAG.
The class Usecase
is used to specify which workload generator to create during runtime.
Currently, we only support a DUMMY
use case where the requests in the generated workload only contain dummy texts and questions.
The workload generator, once initialized with a configuration, only provides a single method: generate(self) -> Workload
.
Engine bootstrapper:
The engine bootstrapper pulls up the serving engine based on the configurations (vLLM configuration + LMCache configuration). Currently, we only support starting vLLM (with or without LMCache) from the terminal. We will support docker-based engines in the future. The code can be found in bootstrapper.py
The engine bootstrapper supports the following methods:
start(self)
: Non-blocking function to start the serving engine.wait_until_ready(self, timeout=60) -> bool
: Block until the serving engine is ready or it's dead or timeout is reached. Returns true if it's ready, otherwise false.is_healthy(self) -> bool
: Non-block function to check if the engine is alive or not.close(self)
: blocking function to shutdown the serving engineExperiment runner:
The experiment runner takes in one workload config and $N$ engine configs as input. It does the following things:
The code can be found in driver.py
(WIP)