axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.97k stars 879 forks source link

Add resize_token_embeddings feature #1965

Open ccdv-ai opened 1 month ago

ccdv-ai commented 1 month ago

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

Add the option to resize the token embeddings: PreTrainedModel has this method.

✔️ Solution

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_id)
model.resize_token_embeddings(num_tokens)

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

NanoCode012 commented 1 month ago

May I ask what is the use case? We currently resize to the tokenizer's length (or multiple of 32 to it if enabled).

ccdv-ai commented 3 weeks ago

Can be usefull for a tokenizer/custom tokenizer and vocab_size mismatch. For instance, there is a mismatch for all qwen 2.5 models where they padded the embedding layer for distributed training.

NanoCode012 commented 3 weeks ago

@ccdv-ai , just to clarify, would you want a config that lets you specify the new tokenizer vocab size or just to resize?

Axolotl does the latter* under the hood when you add new tokens. https://github.com/axolotl-ai-cloud/axolotl/blob/8c3a727f9d60ffd3af385f90bcc3fa3a56398fe1/src/axolotl/utils/models.py#L1039-L1053

If you enable resize_token_embeddings_to_32x: true, it will resize to next multiple of 32.

ccdv-ai commented 3 weeks ago

@NanoCode012 Only the option to choose and resize the token_embeddings to an arbitrary value. For example, Qwen 2.5 7B tokenizer has 151665 tokens but the embedding layer has 152064. resize_token_embeddings_to: 151665 should be possible.

if self.cfg.resize_token_embeddings_to < len(self.tokenizer):
    #Warning or stop
self.model.resize_token_embeddings(self.cfg.resize_token_embeddings_to, **resize_kwargs)
NanoCode012 commented 3 weeks ago

@ccdv-ai thanks for clarifying. To add on to your point, we already do resize to tokenizer.

# above code but summarized

embeddings_len = len(self.tokenizer) 

if ( 
     self.model.get_input_embeddings().num_embeddings < embeddings_len 
 ): 
     self.model.resize_token_embeddings(embeddings_len)

For resizing to another value (!=len(self.tokenizer) ) , I'm not sure I understand the use case as the tokenizer would then mismatch the embedding length and cause an error during training.