bigscience-workshop / multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language
https://arxiv.org/abs/2212.09535
Apache License 2.0
69 stars 15 forks source link

Check extend-vocab functionality; clean up extend-vocab model training #20

Closed haileyschoelkopf closed 2 years ago

haileyschoelkopf commented 2 years ago

@yongzx This is the PR!

I use a function zero_grad to zero out the gradients for the original vocab in the same hook. This means we don't have to use a new tensor for gradient_mask on GPU (it would be a big tensor-- at least [250880, 1024] in shape!) I also confirmed it works with this--see the code I have on lines 673-675 and 711-720 for how I confirmed it. (these comments can be removed after that.)

I also added model.tie_weights(), since it wasn't in your code yet.