Embedding values in different training environment

dhayanesh commented 1 year ago

Hi, I'm training the DLRM model in two different settings,

Single Node with Multi GPU
Multi Node with 2 CPU

I'm seeing a difference in the embedding values after certain epoch in both settings. Can you provide some insights into this?

Sample embeddings: In Single Node with Multi GPU, I got the dlrm.emb_l.state_dict() like: OrderedDict([('0.weight', tensor([[ 0.19507, -0.21561], [-0.27545, 0.04792], [ 0.21673, -0.08091], [ 0.47925, 0.18253]])), ('1.weight', tensor([[-0.02116, -0.12329], [-0.17937, 0.26646], [-0.06997, -0.50720]])), ('2.weight', tensor([[-0.14307, 0.33310], [-0.44806, -0.46139]]))])

In 2-cpu mode: OrderedDict([('0.weight', tensor([[ 0.19971, -0.21732], [-0.26437, 0.04637], [ 0.22765, -0.08109], [ 0.48709, 0.18019]])), ('1.weight', tensor([[-0.03650, -0.12445], [-0.20467, 0.26369], [-0.08280, -0.50890]]))]

mnaumovfb commented 1 year ago

I would start by checking whether the randomly initialized embeddings are the same across these two settings.

If they are not then it is likely that the random initialization on a single and multiple nodes is not the same.
If they are then I would look for randomness introduced in the data loader, starting with the first few batches.

hrwleo commented 1 year ago

已收到邮件阿里阿豆故咋一马斯！如未及时回复，请致电15868848097 QQ：812737452

facebookresearch / dlrm

Embedding values in different training environment #351