AI-Hypercomputer / JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
Apache License 2.0
202 stars 26 forks source link

Prerequisite work for supporting disaggregation: #68

Closed zhihaoshan-google closed 5 months ago

zhihaoshan-google commented 5 months ago
  1. Add transfer thread to transfer KV Cache.
  2. For interleaved mode, prioritize prefill and improve the HBM utilization.