NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[QUESTION] Why does the tokenizer of mamba-2-hybrid have two ids for the token 'Yes'? id 24639 and id 7298 #889

Closed Mooler0410 closed 1 day ago

Mooler0410 commented 3 days ago

Hi! I found that the id 24639 and id 7298 will be decoded to the same token 'Yes' for mamba-2-hybrid.

$: tokenizer.detokenize([24639])==tokenizer.detokenize([7298]) 
$: True

Also:

$: tokenizer.detokenize([24639])
$: 'Yes'
$: tokenizer.detokenize([7298])
$: 'Yes'

I always think different ids correspond to different tokens. Is there anything wrong with my understanding?

Thanks!

Mooler0410 commented 3 days ago

https://github.com/NVIDIA/Megatron-LM/tree/main/examples/mamba

I use this script to try mamba-2. And I insert three lines:

    from megatron.training import get_tokenizer
    tokenizer = get_tokenizer()
    import pdb; pdb.set_trace()

before this line: https://github.com/NVIDIA/Megatron-LM/blob/e33c8f78a35765d5aa37475a144da60e8a2349d1/tools/run_mamba_text_generation_server.py#L105

to test the tokenizer.