Orion-14B is a family of models includes a 14B foundation LLM, and a series of models: a chat model, a long context model, a quantized model, a RAG fine-tuned model, and an Agent fine-tuned model. Orion-14B 系列模型包括一个具有140亿参数的多语言基座大模型以及一系列相关的衍生模型,包括对话模型,长文本模型,量化模型,RAG微调模型,Agent微调模型等。
Apache License 2.0
785
stars
57
forks
source link
During Request Chat GPU Memory Usage Sharply Increased #40
When the conversation started, the original 9G GPU memory usage, increased to 13G,
Test 4 concurrent sessions, the Mem growth to 22G has not stopped signs, only when the session is completely over a period of time, the Mem usage will be released.
It's easy to trigger a Crash.
Question
Is there a way to prevent the rapid linear growth of GPU memory usage?
Is this caused by enabling cache policy? Whether Mem can be used instead of GPU Mem?
Model
Orion-14B-Chat-Int4
Description
When the conversation started, the original 9G GPU memory usage, increased to 13G, Test 4 concurrent sessions, the Mem growth to 22G has not stopped signs, only when the session is completely over a period of time, the Mem usage will be released. It's easy to trigger a Crash.
Question