[QUESTION] Why does the tokenizer of mamba-2-hybrid have two ids for the token 'Yes'? id 24639 and id 7298

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

Other

9.23k stars 2.08k forks source link

Closed Mooler0410 closed 1 day ago

Mooler0410 commented 3 days ago

Hi! I found that the id 24639 and id 7298 will be decoded to the same token 'Yes' for mamba-2-hybrid.

$: tokenizer.detokenize([24639])==tokenizer.detokenize([7298]) 
$: True

Also:

$: tokenizer.detokenize([24639])
$: 'Yes'
$: tokenizer.detokenize([7298])
$: 'Yes'

I always think different ids correspond to different tokens. Is there anything wrong with my understanding?

Thanks!

Mooler0410 commented 3 days ago

I use this script to try mamba-2. And I insert three lines:

    from megatron.training import get_tokenizer
    tokenizer = get_tokenizer()
    import pdb; pdb.set_trace()

to test the tokenizer.