Open zheyuye opened 4 years ago
In summary, the feature request is that save_parameters(..., deduplicate=True)
stores all the names a shared parameter is known under so that the resulting parameter file can be loaded for arbitrary variations of the original models in which a different set of parameters is shared.
It's not really a bug, because the same limitation is present in the MXNet 1.x save_parameters(..., deduplicate=True)
. It's just that due to internal implementation change, in 1.x the first name under which the parameter was known would be stored, whereas currently the last name under which the parameter is known is stored.
Description
Output:
Here
l1
andl2
are shared and thanks for the flagdeduplicate
, we could save shared paremeters only once as well as the dictionary correspondence using the last parameter name as key likedict_keys(['l2.weight', 'l2.bias'])
. There's nothing wrong with that unless we just load part parameters, asfoo2 = Foo(use_mlm=False)
.Of course we can solve this problem by calling L1 repeatedly instead of creating a separate layer
l2
sharing weights withl1
. The following scenario is fairly common in pretraind model with masked language modelling as pretrained objectiveHere
mlm_decoder
is only used in pretraining and woube be discard when fine-tuning down-stream tasks. In themlm_decoder
, we usually need to predict the masked token by mapping back to thevocab_index
through a dense where parameters are shared withword_embed
. However, saving in this way results in parameters withoutword_embed.weight
.