OrionStarAI / Orion

Orion-14B is a family of models includes a 14B foundation LLM, and a series of models: a chat model, a long context model, a quantized model, a RAG fine-tuned model, and an Agent fine-tuned model. Orion-14B 系列模型包括一个具有140亿参数的多语言基座大模型以及一系列相关的衍生模型,包括对话模型,长文本模型,量化模型,RAG微调模型,Agent微调模型等。
Apache License 2.0
785 stars 57 forks source link

During Request Chat GPU Memory Usage Sharply Increased #40

Open Janus-Xu opened 9 months ago

Janus-Xu commented 9 months ago

Model

Orion-14B-Chat-Int4

Description

When the conversation started, the original 9G GPU memory usage, increased to 13G, Test 4 concurrent sessions, the Mem growth to 22G has not stopped signs, only when the session is completely over a period of time, the Mem usage will be released. It's easy to trigger a Crash.

Question

  1. Is there a way to prevent the rapid linear growth of GPU memory usage?
  2. Is this caused by enabling cache policy? Whether Mem can be used instead of GPU Mem?