Closed dhayanesh closed 1 year ago
Hi, I'm training the DLRM model in two different settings,
I'm seeing a difference in the embedding values after certain epoch in both settings. Can you provide some insights into this?
Sample embeddings: In Single Node with Multi GPU, I got the dlrm.emb_l.state_dict() like: OrderedDict([('0.weight', tensor([[ 0.19507, -0.21561], [-0.27545, 0.04792], [ 0.21673, -0.08091], [ 0.47925, 0.18253]])), ('1.weight', tensor([[-0.02116, -0.12329], [-0.17937, 0.26646], [-0.06997, -0.50720]])), ('2.weight', tensor([[-0.14307, 0.33310], [-0.44806, -0.46139]]))])
In 2-cpu mode: OrderedDict([('0.weight', tensor([[ 0.19971, -0.21732], [-0.26437, 0.04637], [ 0.22765, -0.08109], [ 0.48709, 0.18019]])), ('1.weight', tensor([[-0.03650, -0.12445], [-0.20467, 0.26369], [-0.08280, -0.50890]]))]
I would start by checking whether the randomly initialized embeddings are the same across these two settings.
已收到邮件 阿里阿豆故咋一马斯!如未及时回复,请致电15868848097 QQ:812737452
Hi, I'm training the DLRM model in two different settings,
I'm seeing a difference in the embedding values after certain epoch in both settings. Can you provide some insights into this?
Sample embeddings: In Single Node with Multi GPU, I got the dlrm.emb_l.state_dict() like: OrderedDict([('0.weight', tensor([[ 0.19507, -0.21561], [-0.27545, 0.04792], [ 0.21673, -0.08091], [ 0.47925, 0.18253]])), ('1.weight', tensor([[-0.02116, -0.12329], [-0.17937, 0.26646], [-0.06997, -0.50720]])), ('2.weight', tensor([[-0.14307, 0.33310], [-0.44806, -0.46139]]))])
In 2-cpu mode: OrderedDict([('0.weight', tensor([[ 0.19971, -0.21732], [-0.26437, 0.04637], [ 0.22765, -0.08109], [ 0.48709, 0.18019]])), ('1.weight', tensor([[-0.03650, -0.12445], [-0.20467, 0.26369], [-0.08280, -0.50890]]))]