关于MMoe网络一些疑问

当前现状 paddle在实现MMoe时，在每个expert权重初始化时用了常量初始化，weight_attr=nn.initializer.Constant(value=0.1), bias_attr=nn.initializer.Constant(value=0.1), 详情参见https://github.com/PaddlePaddle/PaddleRec/blob/master/models/multitask/mmoe/net.py#L37
存在的问题因为在mmoe中每个expert喂入的样本（特征）是一样的，之所以每个expert可以学到不同的东西，核心原因在于数据存在multi-view，但这是有前提条件的，即每个expert的权重初始化是要不一样的，可以随机可以其他，总之不能相同。而paddle在实现时恰恰犯了这个致命错误，paddle把每个expert权重都初始化成了1，这样会导致每个expert最终学到的网络参数趋向一样，也就意味着每个expert失去了difference，导致最终失去了ensemble的意义，gate的初始化也有问题，但问题没有expert严重。
验证我基于paddle公布的代码和数据，训练完成后，把每个expert的权重打出来，结果证实了2中陈述的问题，详情参见： ('net.state_dict(): ', OrderedDict([('expert_0.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_0.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_1.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_1.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_2.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_2.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_3.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_3.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_4.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_4.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_5.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_5.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_6.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_6.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_7.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_7.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('gate_0.weight', Parameter containing: Tensor(shape=[499, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('gate_0.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000])), ('tower_0.weight', Parameter containing: Tensor(shape=[16, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330]])), ('tower_0.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810])), ('tower_out_0.weight', Parameter containing: Tensor(shape=[8, 2], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131]])), ('tower_out_0.bias', Parameter containing: Tensor(shape=[2], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10621535, 0.09378465])), ('gate_1.weight', Parameter containing: Tensor(shape=[499, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('gate_1.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000])), ('tower_1.weight', Parameter containing: Tensor(shape=[16, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549]])), ('tower_1.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954])), ('tower_out_1.weight', Parameter containing: Tensor(shape=[8, 2], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940]])), ('tower_out_1.bias', Parameter containing: Tensor(shape=[2], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08894448, 0.11105615]))]))

请验证是否存在此问题，若存在请修复。

PaddlePaddle / PaddleRec

关于MMoe网络一些疑问 #721