Benchmark toolkit support

kerthcet commented 3 months ago

What would you like to be added:

It would be super great to support benchmarking the LLM throughputs or latencies with different backends.

Why is this needed:

Provide proofs for users.

Completion requirements:

This enhancement requires the following artifacts:

[ ] Design doc
[ ] API change
[x] Docs update

The artifacts should be linked in subsequent comments.

kerthcet commented 3 months ago

/kind feature

kerthcet commented 3 months ago

An example would looks like:

{
  metadata: {
    name: llama3-405b-2024-07-01,
    namespace: llm,
  },
  spec: {
    endpoint: llm-1.svc.local,
    port: 8000, 
    performance: {
      traffic-shape: {
        req-rate: 10 qps,
        model-type: instruction-tuned-llm/diffusion,
        dataset: share-gpt,
        input-length: 1024,
        max-output-length: 1024,
        total-prompts: 1000,
        traffic-spike: {
          burst: 10m,
          req-rate: 20 qps,
        }
      }
    }
  },
  status: {
    status: success,
    results: gcs-bucket-1/llama3-405b-2024-07-01,
  }
}

Inspired by https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ/edit

kerthcet commented 2 months ago

see https://github.com/ray-project/llmperf and https://github.com/run-ai/llmperf also. We may need a new repo.

InftyAI / llmaz

Benchmark toolkit support #66