-
Wondering if streaming output is supported? Or are there any results about the time to first token and time per output token? Thanks.
-
-
根据官方教程进行复现 进行微调时出现grad_norm:nan
参数配置如下:
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = False
# Data
data_path = 'data'
prompt_template = PROMPT_TEMPLAT…
-
### 描述该功能
血书 4B 模型 - Reason:
1. InternLM2-1.8B-Chat 在能力上有待提高,经过量化处理后效果不够好,无法做到中文翻译成日语的工作。
2. Qwen-4B-Chat-Int4 在赫萝的翻译任务上把中文翻译成日语效果较好。尽管 InternLM2-Chat-7B 也可以完成,但是显存消耗过高。
3. InternLM2 缺少一个中量级模型,显存较…
-
### Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
### Describe the bug
lmdeploy serve api_server --s…
-
对于官方发布的 LLMS 大模型,建议在未来可以附上 awq 和 gptq 的量化版本。这种做法几乎没有成本,但却能帮助许多缺乏 GPU 的潜在用户。这会让用户在使用模型时更加方便,因为大家普遍认为官方发布的量化版本更具权威性。
For officially released LLMs, it is suggested that awq and gptq quantized versions b…
-
两张卡,第一张(12G/32G),第二张(1G/32G)。模型是internlm2-chat-7b。
- 只使用第二张卡加载,显存占用大概30G,可以正常启动;
- 因为占用快满了,想利用下第一张卡,在注册页面设置了gpu-index为0,1。启动时就报错Remote server unixsocket
-
I'm working on an attention backend based on `xformers` to improve performance on V100; is there anything I need to be aware of when doing so or should it be straightforward?
-
### Motivation
when do the w8a8 quantization in pytorch engine, I found that InternLM2 modeling like. It use self.attention, self.feed_forward...
```python
class InternLM2DecoderLayer(nn.Module)…
-
Does llama3 support inference and fine-tuning on multi-graphics card machines? Could you please add some sample code for a single machine with multiple cards?