Certain models require token id changes in their configs

IamAdiSri / hf-trim

Reduce the size of pretrained Hugging Face models via vocabulary trimming.

Mozilla Public License 2.0

37 stars 4 forks source link

Certain models require token id changes in their configs #4

Open IamAdiSri opened 1 year ago

IamAdiSri commented 1 year ago

The regular MBart model (and not MBart-50) for example, has a config property decoder_start_token_id that needs to be updated after the model is trimmed. The model pulls this id from the config during the decoding phase.

This change be made via the following command:

mt.trimmed_model.config.update({
    'decoder_start_token_id': tt.trimmed_tokenizer.convert_tokens_to_ids(
        tt.tokenizer.convert_ids_to_tokens(mt.model.config['decoder_start_token_id'])
    )
})

It is highly likely that other models also have this problem, and to account for this will require some breaking changes. Leaving this up as an issue to fix in release 4.

avacaondata commented 1 year ago

Does this affect mt5 models also? @IamAdiSri

IamAdiSri commented 1 year ago

@avacaondata Hi! Looking at the code on the HuggingFace repository this does affect mt5 models. Even if it doesn't you can run the fix above to be safe. It shouldn't cause issues.

I'll add some steps to follow to be able to run the models as intended:

Load the model and trim it.
Update the decoder_start_token_id in the config, as shown above.
Save the model and tokenizer. (optional)
Reload a new instance of the model and tokenizer for use. (optional)

Saving the trimmed model and starting a new instance allows you to discard the full model and free up memory so I generally recommend doing that.

avacaondata commented 1 year ago

Okay great I will try that out, thanks! @IamAdiSri

The thing is, shouldn't we respect special tokens (such as decoder start token id) when trimming the tokenizer? I mean, we want to keep only tokens that are present in a certain vocabulary, plus the special tokens which are typically at the beginning of the vocabulary (idx < 150), as these are used in all cases, no matter which data you use for trimming.

IamAdiSri commented 1 year ago

@avacaondata that is exactly what the library does. We save all the special tokens, however, after the model is trimmed their index in the embedding matrix may change. So we update the model config to let it know what the new indices for the same special tokens are and then it can reuse them just as before.

The issue is that Huggingface has multiple mechanisms for special tokens. A lot of times the model has a default token id for special tokens or it asks the tokenizer for the id, both of which hftrim already preserves. However, in some cases this tokenid is inferred from the config, which the library does not currently update, and so you have to do it manually. I'll be fixing this in the next release so that this is also taken care of automatically.

Also, like you noted most of the special tokens are at the start but this does not seem to be the case for the decoder token id i think, which is why it shifts indices.