PaddlePaddle / PaddleRec

Recommendation Algorithm大规模推荐算法库,包含推荐系统经典及最新算法LR、Wide&Deep、DSSM、TDM、MIND、Word2Vec、Bert4Rec、DeepWalk、SSR、AITM,DSIN,SIGN,IPREC、GRU4Rec、Youtube_dnn、NCF、GNN、FM、FFM、DeepFM、DCN、DIN、DIEN、DLRM、MMOE、PLE、ESMM、ESCMM, MAML、xDeepFM、DeepFEFM、NFM、AFM、RALM、DMR、GateNet、NAML、DIFM、Deep Crossing、PNN、BST、AutoInt、FGCNN、FLEN、Fibinet、ListWise、DeepRec、ENSFM,TiSAS,AutoFIS等,包含经典推荐系统数据集criteo 、movielens等
https://paddlerec.readthedocs.io/
Apache License 2.0
4.17k stars 717 forks source link

关于MMoe网络一些疑问 #721

Open tz28 opened 2 years ago

tz28 commented 2 years ago
  1. 当前现状 paddle在实现MMoe时,在每个expert权重初始化时用了常量初始化,weight_attr=nn.initializer.Constant(value=0.1), bias_attr=nn.initializer.Constant(value=0.1), 详情参见https://github.com/PaddlePaddle/PaddleRec/blob/master/models/multitask/mmoe/net.py#L37

  2. 存在的问题 因为在mmoe中每个expert喂入的样本(特征)是一样的,之所以每个expert可以学到不同的东西,核心原因在于数据存在multi-view,但这是有前提条件的,即每个expert的权重初始化是要不一样的,可以随机可以其他,总之不能相同。而paddle在实现时恰恰犯了这个致命错误,paddle把每个expert权重都初始化成了1,这样会导致每个expert最终学到的网络参数趋向一样,也就意味着每个expert失去了difference,导致最终失去了ensemble的意义,gate的初始化也有问题,但问题没有expert严重。

  3. 验证 我基于paddle公布的代码和数据,训练完成后,把每个expert的权重打出来,结果证实了2中陈述的问题,详情参见: ('net.state_dict(): ', OrderedDict([('expert_0.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_0.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_1.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_1.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_2.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_2.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_3.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_3.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_4.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_4.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_5.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_5.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_6.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_6.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('expert_7.weight', Parameter containing: Tensor(shape=[499, 16], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08814958, 0.08814958, 0.08814958, ..., 0.08814958, 0.08814958, 0.08814958], [0.09953024, 0.09953024, 0.09953024, ..., 0.09953024, 0.09953024, 0.09953024], [0.09432000, 0.09432000, 0.09432000, ..., 0.09432000, 0.09432000, 0.09432000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.08475424, 0.08475424, 0.08475424, ..., 0.08475424, 0.08475424, 0.08475424], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('expert_7.bias', Parameter containing: Tensor(shape=[16], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424, 0.08475424])), ('gate_0.weight', Parameter containing: Tensor(shape=[499, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('gate_0.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000])), ('tower_0.weight', Parameter containing: Tensor(shape=[16, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330], [0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330, 0.10692330]])), ('tower_0.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810, 0.10714810])), ('tower_out_0.weight', Parameter containing: Tensor(shape=[8, 2], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131], [0.10613869, 0.09386131]])), ('tower_out_0.bias', Parameter containing: Tensor(shape=[2], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10621535, 0.09378465])), ('gate_1.weight', Parameter containing: Tensor(shape=[499, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], ..., [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000], [0.10000000, 0.10000000, 0.10000000, ..., 0.10000000, 0.10000000, 0.10000000]])), ('gate_1.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000, 0.10000000])), ('tower_1.weight', Parameter containing: Tensor(shape=[16, 8], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549], [0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549, 0.08482549]])), ('tower_1.bias', Parameter containing: Tensor(shape=[8], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954, 0.08414954])), ('tower_out_1.weight', Parameter containing: Tensor(shape=[8, 2], dtype=float32, place=CPUPlace, stop_gradient=False, [[0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940], [0.10030183, 0.09969940]])), ('tower_out_1.bias', Parameter containing: Tensor(shape=[2], dtype=float32, place=CPUPlace, stop_gradient=False, [0.08894448, 0.11105615]))]))

请验证是否存在此问题,若存在请修复。

frankwhzhang commented 2 years ago

感谢提问,这个问题是因为当时需要保证每次训练的结果是一致的,所以从随机初始化改为了固定值,我们后续会修复这个问题