During Request Chat GPU Memory Usage Sharply Increased

OrionStarAI / Orion

Orion-14B is a family of models includes a 14B foundation LLM, and a series of models: a chat model, a long context model, a quantized model, a RAG fine-tuned model, and an Agent fine-tuned model. Orion-14B 系列模型包括一个具有140亿参数的多语言基座大模型以及一系列相关的衍生模型，包括对话模型，长文本模型，量化模型，RAG微调模型，Agent微调模型等。

Apache License 2.0

785 stars 57 forks source link

Model

Orion-14B-Chat-Int4

Description

When the conversation started, the original 9G GPU memory usage, increased to 13G, Test 4 concurrent sessions, the Mem growth to 22G has not stopped signs, only when the session is completely over a period of time, the Mem usage will be released. It's easy to trigger a Crash.

Question

Is there a way to prevent the rapid linear growth of GPU memory usage?
Is this caused by enabling cache policy? Whether Mem can be used instead of GPU Mem？

OrionStarAI / Orion