alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
715 stars 102 forks source link

DeepSeekV2Tokenizer should use padding_side="right" in __init__()! #368

Open pqhgit opened 3 weeks ago

pqhgit commented 3 weeks ago

DeepSeekV2Tokenizer init() now is not use padding_side="right", it cause the labels same as input_ids, and label[:source_len] = self.IGNORE_INDEX is not effect。

The bug code is below: megatron_patch/tokenizer/init.py class _DeepSeekV2Tokenizer(MegatronTokenizer): def __init__(self, tokenizer_path, extra_vocab_size): super().__init__(tokenizer_path) self.tokenizer = AutoTokenizer.from_pretrained( tokenizer_path, trust_remote_code=True ) self.extra_vocab_size = extra_vocab_size

jerryli1981 commented 3 weeks ago

您好,您提到的这个我没有完全理解,方便进群加下我的钉钉咱们详细聊下吗?

jerryli1981 commented 3 weeks ago

在sft时候对原始数据的处理我们现在都采用的是新版的基于template的方案:https://github.com/alibaba/Pai-Megatron-Patch/blob/main/megatron_patch/data/llama_sft.py

pqhgit commented 3 weeks ago

@jerryli1981 你好,我在老版本看到这个问题,_DeepSeekV2Tokenizer初始化的时候没有指定padding_side='right',导致用了默认的left padding,导致后面的label的处理逻辑出现了问题:label[:source_len] = self.IGNORE_INDEX 这段逻辑未正常生效。 新版本我再使用看看。

jerryli1981 commented 3 weeks ago

padding_side='right'

您好,我觉得您发现的确实是个bug,我们重新校验了下所有的tokenizer发现只有deepseek这个没有添加padding_side='right', 实在抱歉啊,我们通过一个PR修复了下,您看看哈:https://github.com/alibaba/Pai-Megatron-Patch/pull/370