Open ccdv-ai opened 1 month ago
May I ask what is the use case? We currently resize to the tokenizer's length (or multiple of 32 to it if enabled).
Can be usefull for a tokenizer/custom tokenizer and vocab_size mismatch. For instance, there is a mismatch for all qwen 2.5 models where they padded the embedding layer for distributed training.
@ccdv-ai , just to clarify, would you want a config that lets you specify the new tokenizer vocab size or just to resize?
Axolotl does the latter* under the hood when you add new tokens. https://github.com/axolotl-ai-cloud/axolotl/blob/8c3a727f9d60ffd3af385f90bcc3fa3a56398fe1/src/axolotl/utils/models.py#L1039-L1053
If you enable resize_token_embeddings_to_32x: true
, it will resize to next multiple of 32.
@NanoCode012 Only the option to choose and resize the token_embeddings to an arbitrary value.
For example, Qwen 2.5 7B tokenizer has 151665 tokens but the embedding layer has 152064.
resize_token_embeddings_to: 151665
should be possible.
if self.cfg.resize_token_embeddings_to < len(self.tokenizer):
#Warning or stop
self.model.resize_token_embeddings(self.cfg.resize_token_embeddings_to, **resize_kwargs)
@ccdv-ai thanks for clarifying. To add on to your point, we already do resize to tokenizer.
# above code but summarized
embeddings_len = len(self.tokenizer)
if (
self.model.get_input_embeddings().num_embeddings < embeddings_len
):
self.model.resize_token_embeddings(embeddings_len)
For resizing to another value (!=len(self.tokenizer)
) , I'm not sure I understand the use case as the tokenizer would then mismatch the embedding length and cause an error during training.
⚠️ Please check that this feature request hasn't been suggested before.
🔖 Feature description
Add the option to resize the token embeddings:
PreTrainedModel
has this method.✔️ Solution
❓ Alternatives
No response
📝 Additional Context
No response
Acknowledgements