Closed xingyaoww closed 7 months ago
Hey, thanks for the pull request and for pointing out the potential issue with the ordering of the special tokens! I'm a bit confused with your first fix. Right now if --no_new_tokens
is set, then the tokenizer._vocab
will never be expanded with the special tokens. We will just check if for some reason the token is already there and update the special_token_mask, but never add new tokens (here).
Ohhh now I remember why i did that. The reason for my changes was that if we set --no_new_tokens
, the _add_special_token
function will just do nothing for all these tokens (including vocab_extra_ids_list
).
But the expected outcome in the mentioned issue (Originally posted by @andreaskoepf in https://github.com/epfLLM/Megatron-LLM/issues/19#issuecomment-1677015358) should be: when --no_new_tokens
is set, we ignore adding built-in special tokens (<CLS>, <SEP>, <EOD>, <MASK>
), but we still add the vocab_extra_ids
to vocab when they are specified.
Oh I see the issue now, that makes sense. What about modifying _add_special_token
like this instead:
def _add_special_token(t, force=False):
if t not in self.vocab and not new_tokens and not force:
return
if t not in self._vocab:
next_id = len(self._vocab)
self._vocab[t] = next_id
self._inv_vocab[next_id] = t
self._special_tokens[t] = self._vocab[t]
self._inv_special_tokens[self._vocab[t]] = t
And call it with force=True
when dealing with the vocab_extra_ids_list
and vocab_extra_ids
. That way the _cls_id
and _sep_id
will be set to None anyways, and we shouldn't have to modify the megatron_to_hf.py
? (only to fix the token ordering).
Hey @AleHD thanks for the suggestion. I've adjusted the code to use force=True
so that we can have this implemented with minimal code change.
Support to use
--no_new_tokens
to ignore adding built-in special tokens (<CLS>, <SEP>, <EOD>, <MASK>
) which might not be necessary (Originally posted by @andreaskoepf in https://github.com/epfLLM/Megatron-LLM/issues/19#issuecomment-1677015358).Fix new token ordering issue of the", ""] to a set before adding them to the special token which can mess-up the order (Megatron LLM seem to add these special tokens following the list order).
weights_conversion/megatron_to_hf.py
conversion script: When set--vocab_extra_ids_list "<|im_start|>,<|im_end|>"
, thehf_tokenizer.add_special_tokens(special_tokens_dict=special_tokens, replace_additional_special_tokens=True)
will cause the extra ids to be added with a different order to thehf_tokenizer
since the hf tokenizer implementation will convert the ["