Weight init with non-default tokenizer bug fix

ohallstrom commented 3 months ago

Changes Currently, training with a tokenizer with different vocab size from google-bert/bert-base-uncased gives an initial loss that is 10X bigger than expected. This comes from the embedding resizing in the class PreTrainedModel, which the FlexBert classes inherits from, for which the _init_weights() is called with the embeddings module as the argument. Following #84, _init_weights() in FlexBert classes overrides the corresponding function with a different signature than in BertPreTrainedModel, and consequently function calls to _init_weights() in FlexBert classes from PreTrainedModel gives incorrect behavior. This PR enables embedding resizing with correct init, and in case of other calls to _init_weights() from the transformers library we get an implementation error, instead of a silent issue.

The other call to _init_weights I have identified in the transformers library that can reach the flexbert classes is the lm_head resizing, which happens for training with a tokenizer with different vocab size from google-bert/bert-base-uncased when weights are not tied. Custom init of this module is not as straight forward as there are several potential nn.Linear modules, and these do not have the same module type in the src.bert_layers.initialization.init_weights function. The FlexBertPretrainedModel does not keep track of the true vocal size (only the vocab size before resizing), making it harder to identify a lm head. An ugly solution, assuming that any nn.Linear module with hidden_size as input size and a output size bigger than 16k is a lm head, could be added in FlexBertPreTrainedModel._init_module_weights at row 793:

elif isinstance(module, nn.Linear):
            input_dim, output_dim = module.weight.shape
            # UGLY way to make sure that module is a lm head 
            assert input_dim == self.config.hidden_size and output_dim > 16E3, "nn.Linear module must be a lm head"
            init_weights(self.config, module, self.config.hidden_size, ModuleType.final_out)

Let me know if you have any other suggestions. Else I think we can leave the PR as is now, as we at least don't get a silent error for this corner case which I don't see as very likely at the moment.

Tests

Have run all tests and verified that we get a reasonable loss curve with a different tokenizer than bert-base-uncased using the flex-belt-base.yaml config. I also have verified that training with bert-base-uncased tokenizer and flex-belt-base.yaml config gives identical training metrics with this branch and main.

warner-benjamin commented 3 months ago

Currently, training with a tokenizer with different vocab size from google-bert/bert-base-uncased gives an initial loss that is 10X bigger than expected. This comes from the embedding resizing in the class PreTrainedModel, which the FlexBert classes inherits from, for which the _init_weights() is called with the embeddings module as the argument

Manually setting the tokenizer vocab size results in correct embedding initialization and expected initial loss, which is why I haven't noticed this issue. Maybe it would be easier to resize the vocab in the config from the len(tokenizer.vocab) before creating the model?

ohallstrom commented 3 months ago

Maybe it would be easier to resize the vocab in the config from the len(tokenizer.vocab) before creating the model?

This would work fine in most cases, but if someone would like to extend the tokenizer post pretaining (add new tokens, support for new languages and whatever) this solution would prevent this. Ofc one could ask if this really is something that will happen in practice (and if they in these cases maybe instead would try to to some more advanced init of the new token embeddings), but we could leave that imagination to the community instead of us setting the limits. Dunno if anyone else has an opinion on this? @bclavie @NohTow

But we can add the len(tokenizer) value to the config and this way identify an lm head, to make sure that resizing also works with untied weights

NohTow commented 3 months ago

If I understand correctly, @ohallstrom's solution is a bit more complex but can handle adding new tokens for a pre-trained model (loading weights will break if the layer is already resized). Besides not being able to add tokens for downstream tasks, this prevents starting from a pre-trained model with smaller vocals.

So I would go with Oskar's solution with some comments to explain our choices

warner-benjamin commented 3 months ago

@ohallstrom I trained a quick test model with this PR and I believe it broke the initialization. The top orange line is this PR with megatron init and the bottom purple line is manually setting the tokenizer vocab size from main with megatron init.

As a sanity check, I trained another test model from main, which results in a near identical loss curve.

ohallstrom commented 3 months ago

I would not expect embedding resizing vs manually setting vocab size to have identical loss curves, as initializing with resizing means that you first init the embedding matrix with default_vocab_size. When the resizing occurs, you init a new embedding matrix with the full vocab size, but then overwrite the first default_vocab_size look-up values of this new embedding matrix with the values from the original matrix. In other words, I believe it can be viewed as they are initialized with different seeds. Will verify that this holds later

I run Megatron init without resizing using different seeds for a bit more than 100M tokens, and indeed the same init but with different seed can give quite different loss curves. I will verify later that the seed change only impact weight init (judging from the plots data ordering seems unaffected at least).

As these two runs have different seeds for all weights (not only embedding), I can do some tests later where I only init the embedding matrix with different seeds to see what an acceptable diff is.

bclavie commented 3 months ago

Hey, chiming in:

I think this is potentially the more proper way of doing things, but in practice I'm leaning towards the fact that the current way of doing things is the better way to go forward.
The loss curves do appear noticeably different, especially with all else being equal. While they're trending towards the same, it's either an obscure knock-on effect (from the added complexity due to doing things in a more proper way), or damaging initialisation in a way we'd need to spend time evaluating to ensure it heals reasonably fast, which I'm not sure we want to.

This would work fine in most cases, but if someone would like to extend the tokenizer post pretaining (add new tokens, support for new languages and whatever) this solution would prevent this. Ofc one could ask if this really is something that will happen in practice (and if they in these cases maybe instead would try to to some more advanced init of the new token embeddings), but we could leave that imagination to the community instead of us setting the limits. Dunno if anyone else has an opinion on this? @bclavie @NohTow

I think this is a good point, but I'm not sure if it's really going to have an impact? Someone wanting to make such sizeable changes to the tokenizer will (I assume) know what they're doing enough to modify what needs modified, and any other changes will be largely covered by having ~120 unused tokens, which IIRC covers pretty much all the current usecases of token extension.

Overall I think this might be a case of over-optimising with potential side-effects while it's unlikely to be a problem in practice. @ohallstrom Would it be fine with you if we went @warner-benjamin's solution of

Maybe it would be easier to resize the vocab in the config from the len(tokenizer.vocab) before creating the model? seems like

? This seems like the best way IMO

ohallstrom commented 3 months ago

I have tested this in numerous settings again, and just as before opening the PR I did not see a noticeable diff between this branch and main when not using resizing. I then tried using the same model config as @warner-benjamin, and it turns out there is a difference because main has been updated with new attention features which Warner used that were not on my branch. When this branch is rebased I get no diff:

And here is the long run (obviously they did not have the exact same values at all times as for the highlighted step)

When it comes to resizing, that is another question. I misunderstood Warners first comment, thinking the diff came from resizing. Resizing is no prioritization for me, and I am fine with not using it. But I'd be inclined to at least have something that raises an error when the init_weights function is called with its parental signature (in other words when it is inited with a call from the PreTrainedModel with a module as argument) to not have any unpleasant surprises if we would update to a new version of transformers.

ohallstrom commented 3 months ago

I tried resizing with Megatron innit, and it does seem to work as expected similar to my initial tests with default innit. Essentially it is hard to verify resizing as like I said previously, it won't give the same innit as not resizing (but instead passing the vocab size with the config) with the same seed. This is due to the fact that with resizing, the embedding is first initialized with a smaller size before the resizing, and this will affect the resizing of all other modules that are initialized afterwards (see toy example below).

So essentially, resizing is expected to have as similar performance to setting the vocab_size manually, as a run when setting the vocab_size manually with a different seed.

And indeed it seems like the loss is within a reasonable margin

NohTow commented 3 months ago

Then I think it's good to be merged.

AnswerDotAI / bert24

Weight init with non-default tokenizer bug fix #96