InftyAI / llmaz

☸️ Easy, advanced inference platform for large language models on Kubernetes
Apache License 2.0
13 stars 5 forks source link

Benchmark toolkit support #66

Open kerthcet opened 1 month ago

kerthcet commented 1 month ago

What would you like to be added:

It would be super great to support benchmarking the LLM throughputs or latencies with different backends.

Why is this needed:

Provide proofs for users.

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

kerthcet commented 1 month ago

/kind feature

kerthcet commented 1 month ago

An example would looks like:

{
  metadata: {
    name: llama3-405b-2024-07-01,
    namespace: llm,
  },
  spec: {
    endpoint: llm-1.svc.local,
    port: 8000, 
    performance: {
      traffic-shape: {
        req-rate: 10 qps,
        model-type: instruction-tuned-llm/diffusion,
        dataset: share-gpt,
        input-length: 1024,
        max-output-length: 1024,
        total-prompts: 1000,
        traffic-spike: {
          burst: 10m,
          req-rate: 20 qps,
        }
      }
    }
  },
  status: {
    status: success,
    results: gcs-bucket-1/llama3-405b-2024-07-01,
  }
}

Inspired by https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ/edit

kerthcet commented 2 days ago

see https://github.com/ray-project/llmperf and https://github.com/run-ai/llmperf also. We may need a new repo.