facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

Embedding values in different training environment #351

Closed dhayanesh closed 1 year ago

dhayanesh commented 1 year ago

Hi, I'm training the DLRM model in two different settings,

  1. Single Node with Multi GPU
  2. Multi Node with 2 CPU

I'm seeing a difference in the embedding values after certain epoch in both settings. Can you provide some insights into this?

Sample embeddings: In Single Node with Multi GPU, I got the dlrm.emb_l.state_dict() like: OrderedDict([('0.weight', tensor([[ 0.19507, -0.21561], [-0.27545, 0.04792], [ 0.21673, -0.08091], [ 0.47925, 0.18253]])), ('1.weight', tensor([[-0.02116, -0.12329], [-0.17937, 0.26646], [-0.06997, -0.50720]])), ('2.weight', tensor([[-0.14307, 0.33310], [-0.44806, -0.46139]]))])

In 2-cpu mode: OrderedDict([('0.weight', tensor([[ 0.19971, -0.21732], [-0.26437, 0.04637], [ 0.22765, -0.08109], [ 0.48709, 0.18019]])), ('1.weight', tensor([[-0.03650, -0.12445], [-0.20467, 0.26369], [-0.08280, -0.50890]]))]

mnaumovfb commented 1 year ago

I would start by checking whether the randomly initialized embeddings are the same across these two settings.

hrwleo commented 1 year ago

已收到邮件  阿里阿豆故咋一马斯!如未及时回复,请致电15868848097  QQ:812737452