Open kerthcet opened 3 months ago
/kind feature
An example would looks like:
{
metadata: {
name: llama3-405b-2024-07-01,
namespace: llm,
},
spec: {
endpoint: llm-1.svc.local,
port: 8000,
performance: {
traffic-shape: {
req-rate: 10 qps,
model-type: instruction-tuned-llm/diffusion,
dataset: share-gpt,
input-length: 1024,
max-output-length: 1024,
total-prompts: 1000,
traffic-spike: {
burst: 10m,
req-rate: 20 qps,
}
}
}
},
status: {
status: success,
results: gcs-bucket-1/llama3-405b-2024-07-01,
}
}
Inspired by https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ/edit
see https://github.com/ray-project/llmperf and https://github.com/run-ai/llmperf also. We may need a new repo.
What would you like to be added:
It would be super great to support benchmarking the LLM throughputs or latencies with different backends.
Why is this needed:
Provide proofs for users.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.