Check extend-vocab functionality; clean up extend-vocab model training

@yongzx This is the PR!

I use a function zero_grad to zero out the gradients for the original vocab in the same hook. This means we don't have to use a new tensor for gradient_mask on GPU (it would be a big tensor-- at least [250880, 1024] in shape!) I also confirmed it works with this--see the code I have on lines 673-675 and 711-720 for how I confirmed it. (these comments can be removed after that.)

I also added model.tie_weights(), since it wasn't in your code yet.

bigscience-workshop / multilingual-modeling

Check extend-vocab functionality; clean up extend-vocab model training #20