This PR allows to exclude certain groups of parameters (e.g. embeddings, layer norms) from weight decay. The parameter groups are model-dependent and defined as part of the respective model class. Exclusion can be triggered in the training config file.
This PR allows to exclude certain groups of parameters (e.g. embeddings, layer norms) from weight decay. The parameter groups are model-dependent and defined as part of the respective model class. Exclusion can be triggered in the training config file.