alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
674 stars 94 forks source link

tokenize with qwen2tokenizer in megatron_patch/tokenizer/__init__.py #252

Closed lclkent closed 3 months ago

lclkent commented 3 months ago

I was kind of lost with the preprocess data tool. XD Why wrap the qwen2tokenizer with MegatronTokenizer? why not directly use the huggingface tokenizer? And btw, I wasn't able to find a class named MegatronTokenizer in the megatron project. I doubt a code version mismatch, can someone explain this?

jerryli1981 commented 3 months ago

https://github.com/alibaba/Pai-Megatron-Patch/blob/main/megatron_patch/tokenizer/__init__.py#L171

jerryli1981 commented 3 months ago

hi,due to idxmap dataset in latest Megatron-LM, we have to wrap class _Qwen2Tokenizer(MegatronTokenizer):