Open bao-xiaoyi opened 1 week ago
Hi @bao-xiaoyi, can you send me your command you ran assert_packing_loss.py
?
python assert_packing_loss.py /kas/kas_workspace/open_llm/DeepSeek-Coder-V2-Lite-Instruct
你好@bao-xiaoyi,你能把你运行的命令发给我
assert_packing_loss.py
吗?
Additionally, when I use Starcoderv2 for testing, there are also errors reported: assert ( original_token_count == mk_token_count ), f"number of tokens for computing loss is different: original_token_count = {original_token_count}, mk_token_count={mk_token_count}"
Hi @bao-xiaoyi, can you send me your command you ran
assert_packing_loss.py
?
When I use starcoderv2, original_token_count = 147277, And mk_token_count=4014
Hi @bao-xiaoyi, I think the reason for this error is because for this model, it uses the remote code (I mean, it is using modeling_deepseek.py). So you can do as follows:
remote_deepseek
(in the folder packing). Directly replace the function _get_unpad_data
to use the following monkey-patched code (this is equivalent to monkey-patch: modeling_deepseek._get_unpad_data = get_unpad_data
). You can also download the remote_deepseek.zip
I attached in this postdef get_max_seqlen_in_batch(attention_mask):
max_num = torch.max(attention_mask)
# attention_mask: B x N
counts = []
for i in range(1, max_num + 1):
counts.append(
torch.sum(attention_mask == i, axis=-1)
) # shape: B, count length of data point maksed with i
result = torch.stack(counts, axis=1)
result = result.flatten()
return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32)
def _get_unpad_data(attention_mask):
print("monkey-patched")
seqlens_in_batch = get_max_seqlen_in_batch(
attention_mask
) # attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(
torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0)
)
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
About assert_packing_loss.py you can change as follows:
transformers.AutoModelForCausalLM
DeepseekV2ForCausalLM
(from remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM
)
You will see the result that the loss results are almost the same, the difference is only: 0.0021%
You can also download the assert_packing_loss.py
I provided in this Post@bao-xiaoyi for starcoder, which base_model you used, I tested following command and it works:
python assert_packing_loss.py bigcode/starcoder2-7b
@bao-xiaoyi对于 starcoder,您使用了哪个 base_model,我测试了以下命令并且它有效:
python assert_packing_loss.py bigcode/starcoder2-7b
I chose the 15b model, and the average loss is a bit large
你好@bao-xiaoyi,我认为出现此错误的原因是因为对于此模型,它使用了远程代码(我的意思是,它正在使用models_deepseek.py)。因此,您可以执行以下操作:
- 直接从https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct/tree/main复制所有 .py 文件 并保存到文件夹中,例如
remote_deepseek
(在文件夹中打包)。直接替换函数_get_unpad_data
以使用以下 monkey-patched 代码(这相当于 monkey-patch:modeling_deepseek._get_unpad_data = get_unpad_data
)。您也可以下载remote_deepseek.zip
我在这篇文章中附加的def get_max_seqlen_in_batch(attention_mask): max_num = torch.max(attention_mask) # attention_mask: B x N counts = [] for i in range(1, max_num + 1): counts.append( torch.sum(attention_mask == i, axis=-1) ) # shape: B, count length of data point maksed with i result = torch.stack(counts, axis=1) result = result.flatten() return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32) def _get_unpad_data(attention_mask): print("monkey-patched") seqlens_in_batch = get_max_seqlen_in_batch( attention_mask ) # attention_mask.sum(dim=-1, dtype=torch.int32) indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten() max_seqlen_in_batch = seqlens_in_batch.max().item() cu_seqlens = F.pad( torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0) ) return ( indices, cu_seqlens, max_seqlen_in_batch, )
关于assert_packing_loss.py你可以进行如下更改:
- 在计算原始数据的损失时,使用加载模型
transformers.AutoModelForCausalLM
- 在计算打包数据的损失时,使用
DeepseekV2ForCausalLM
(from remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM
)加载模型 ,你会看到损失结果几乎相同,差异仅在:0.0021% 你也可以下载assert_packing_loss.py
我在这篇文章中提供的
I don't quite understand why local code should be used when using packing, and remote code can be used when not packing? Why doesn't modeling_deepseek._get_unpad_data = get_unpad_data work?
Hi @bao-xiaoyi, I think the reason for this error is because for this model, it uses the remote code (I mean, it is using modeling_deepseek.py). So you can do as follows:
- Directly copy all .py files from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct/tree/main and save to a folder, for example,
remote_deepseek
(in the folder packing). Directly replace the function_get_unpad_data
to use the following monkey-patched code (this is equivalent to monkey-patch:modeling_deepseek._get_unpad_data = get_unpad_data
). You can also download theremote_deepseek.zip
I attached in this postdef get_max_seqlen_in_batch(attention_mask): max_num = torch.max(attention_mask) # attention_mask: B x N counts = [] for i in range(1, max_num + 1): counts.append( torch.sum(attention_mask == i, axis=-1) ) # shape: B, count length of data point maksed with i result = torch.stack(counts, axis=1) result = result.flatten() return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32) def _get_unpad_data(attention_mask): print("monkey-patched") seqlens_in_batch = get_max_seqlen_in_batch( attention_mask ) # attention_mask.sum(dim=-1, dtype=torch.int32) indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten() max_seqlen_in_batch = seqlens_in_batch.max().item() cu_seqlens = F.pad( torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0) ) return ( indices, cu_seqlens, max_seqlen_in_batch, )
About assert_packing_loss.py you can change as follows:
- in computing the loss of original data, load model using
transformers.AutoModelForCausalLM
- in computing the loss of packed data, load model using
DeepseekV2ForCausalLM
(from remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM
) You will see the result that the loss results are almost the same, the difference is only: 0.0021% You can also download theassert_packing_loss.py
I provided in this Post
Moreover, the comparison of time consumption does not seem as exaggerated as shown in the readme. I tested Deepseek using the code you modified, and the time comparison is 18.712671 vs 7.400667 or 9.163215 vs 6.737796
@bao-xiaoyi I think directly monkey-patching remote code (trust_remote_code=True
) doesn't work, and to find out the reason, we must investigate deeper on how transformers implemented this feature, I haven't investigated this.
By the way, I have just run:
python original_assert.py bigcode/starcoder2-15b
no errors were found. The result is: the difference between losses are only: 0.0011%
@bao-xiaoyi我认为直接对远程代码进行 monkey-patching (
trust_remote_code=True
) 不起作用,要找出原因,我们必须更深入地研究 transformers 如何实现这个特性,我还没有调查过这个。
Can you provide the time comparison results of your testing on Deepseek? Thank you very much
In running this: python assert_packing_loss.py deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
in the file (assert_packing_loss.py.) I sent you above,
time for computing the loss without packing: 9.336643
time for computing the loss with packing: 2.348312
RuntimeError: CUDA error: an illegal memory access was encountered
Looking forward to the expert's answer