We already support the usage of pretrained input embeddings. However, output embeddings and layers still have to be retrained. One way to use smaller checkpoints when training larger ones (if comparing loss curves doesn't matter) would be to initialise the larger model from the weights of the smaller model by replicating them. As our models always have a fixed width for a given number of devices, loading the checkpoint of a shallower model would be as easy as converting input_embedding-layer1-layer2-output_embedding to input_embedding-layer1-layer2-layer1-layer2-output_embedding.
This issue aims to track the progress of such a scheme and achieve faster convergence by effectively skipping the loss of the first thousand steps.
We already support the usage of pretrained input embeddings. However, output embeddings and layers still have to be retrained. One way to use smaller checkpoints when training larger ones (if comparing loss curves doesn't matter) would be to initialise the larger model from the weights of the smaller model by replicating them. As our models always have a fixed width for a given number of devices, loading the checkpoint of a shallower model would be as easy as converting
input_embedding-layer1-layer2-output_embedding
toinput_embedding-layer1-layer2-layer1-layer2-output_embedding
. This issue aims to track the progress of such a scheme and achieve faster convergence by effectively skipping the loss of the first thousand steps.