DeepSeekV2Tokenizer should use padding_side="right" in __init__()!

alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Apache License 2.0

715 stars 102 forks source link

DeepSeekV2Tokenizer should use padding_side="right" in init()! #368

Open pqhgit opened 3 weeks ago

pqhgit commented 3 weeks ago

DeepSeekV2Tokenizer init() now is not use padding_side="right", it cause the labels same as input_ids, and label[:source_len] = self.IGNORE_INDEX is not effect。

The bug code is below： megatron_patch/tokenizer/init.py class _DeepSeekV2Tokenizer(MegatronTokenizer): def __init__(self, tokenizer_path, extra_vocab_size): super().__init__(tokenizer_path) self.tokenizer = AutoTokenizer.from_pretrained( tokenizer_path, trust_remote_code=True ) self.extra_vocab_size = extra_vocab_size

jerryli1981 commented 3 weeks ago

您好，您提到的这个我没有完全理解，方便进群加下我的钉钉咱们详细聊下吗？

jerryli1981 commented 3 weeks ago

在sft时候对原始数据的处理我们现在都采用的是新版的基于template的方案：https://github.com/alibaba/Pai-Megatron-Patch/blob/main/megatron_patch/data/llama_sft.py

pqhgit commented 3 weeks ago

@jerryli1981 你好，我在老版本看到这个问题，_DeepSeekV2Tokenizer初始化的时候没有指定padding_side='right'，导致用了默认的left padding，导致后面的label的处理逻辑出现了问题：label[:source_len] = self.IGNORE_INDEX 这段逻辑未正常生效。新版本我再使用看看。

jerryli1981 commented 3 weeks ago

padding_side='right'

您好，我觉得您发现的确实是个bug，我们重新校验了下所有的tokenizer发现只有deepseek这个没有添加padding_side='right', 实在抱歉啊，我们通过一个PR修复了下，您看看哈：https://github.com/alibaba/Pai-Megatron-Patch/pull/370

alibaba / Pai-Megatron-Patch

DeepSeekV2Tokenizer should use padding_side="right" in __init__()! #368

DeepSeekV2Tokenizer should use padding_side="right" in init()! #368