Use --no_new_tokens to stop adding built-in special tokens

epfLLM / Megatron-LLM

distributed trainer for LLMs

Other

504 stars 73 forks source link

Use --no_new_tokens to stop adding built-in special tokens #83

Closed xingyaoww closed 7 months ago

xingyaoww commented 9 months ago

Support to use --no_new_tokens to ignore adding built-in special tokens (<CLS>, <SEP>, <EOD>, <MASK>) which might not be necessary (Originally posted by @andreaskoepf in https://github.com/epfLLM/Megatron-LLM/issues/19#issuecomment-1677015358).
Fix new token ordering issue of the weights_conversion/megatron_to_hf.py conversion script: When set --vocab_extra_ids_list "<|im_start|>,<|im_end|>", the hf_tokenizer.add_special_tokens(special_tokens_dict=special_tokens, replace_additional_special_tokens=True) will cause the extra ids to be added with a different order to the hf_tokenizer since the hf tokenizer implementation will convert the ["", ""] to a set before adding them to the special token which can mess-up the order (Megatron LLM seem to add these special tokens following the list order).

AleHD commented 8 months ago

Hey, thanks for the pull request and for pointing out the potential issue with the ordering of the special tokens! I'm a bit confused with your first fix. Right now if --no_new_tokens is set, then the tokenizer._vocab will never be expanded with the special tokens. We will just check if for some reason the token is already there and update the special_token_mask, but never add new tokens (here).

xingyaoww commented 8 months ago

Ohhh now I remember why i did that. The reason for my changes was that if we set --no_new_tokens, the _add_special_token function will just do nothing for all these tokens (including vocab_extra_ids_list).

But the expected outcome in the mentioned issue (Originally posted by @andreaskoepf in https://github.com/epfLLM/Megatron-LLM/issues/19#issuecomment-1677015358) should be: when --no_new_tokens is set, we ignore adding built-in special tokens (<CLS>, <SEP>, <EOD>, <MASK>), but we still add the vocab_extra_ids to vocab when they are specified.

AleHD commented 8 months ago

Oh I see the issue now, that makes sense. What about modifying _add_special_token like this instead:

        def _add_special_token(t, force=False):
            if t not in self.vocab and not new_tokens and not force:
                return
            if t not in self._vocab:
                next_id = len(self._vocab)
                self._vocab[t] = next_id
                self._inv_vocab[next_id] = t
            self._special_tokens[t] = self._vocab[t]
            self._inv_special_tokens[self._vocab[t]] = t

And call it with force=True when dealing with the vocab_extra_ids_list and vocab_extra_ids. That way the _cls_id and _sep_id will be set to None anyways, and we shouldn't have to modify the megatron_to_hf.py? (only to fix the token ordering).

xingyaoww commented 8 months ago

Hey @AleHD thanks for the suggestion. I've adjusted the code to use force=True so that we can have this implemented with minimal code change.