请问最新的200K模型要是接受200k输入显存需要多少？

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.19k stars 377 forks source link

Closed zlh1992 closed 8 months ago

zlh1992 commented 8 months ago

如题

如题

如题

No response

seanxuu commented 8 months ago

同问

lvhan028 commented 8 months ago

取决于 k/v block会占多大内存。

在 lmdeploy 的 turbomind 引擎里，k/v block大小(Bytes)的计算公式是：

cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)

cache_block_seq_len 默认是 128。以 internlm2-7b 为例，按照上面的公式，block 大小是


128 * 32 * 8 * 128 * 2 * 2 = 16MB

也就是，128 的长度占用16MB，1G 的内存，可以支持 1G/16MB * 128 = 8192 长度的序列。

200K的话，就占 200K/8K = 25G。

lvhan028 commented 8 months ago

LMDeploy 有个参数 cache_max_entry_count，表示 k/v cache占一块显卡总显存的百分比。默认是0.5

如果出现 OOM的话，把这个值调小一些。