How to do asynchronous distributed training with DLRM?

facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)

MIT License

3.75k stars 837 forks source link

How to do asynchronous distributed training with DLRM? #297

Closed PavithranRick closed 1 year ago

PavithranRick commented 1 year ago

I'm trying to convert the synchronous distributed training of DLRM to asynchronous distributed training. I understand that the asynchronous distributed training might not result in a model that has the same model accuracy as synchronous distributed training.

Could you please help me with starting points for this sync to async distributed training conversion?

P.S: We removed the blocking wait points in the code and tried the running code but this results in NaN value for gradients.

mnaumovfb commented 1 year ago

I would start by replacing synchronous calls with (i) asynchronous calls followed by immediate wait, and (ii) then move the wait calls farther from the original point to the desired location.

If (i) is producing a Nan you need to understand whether the async+wait combination is doing what you have expected, otherwise (ii) you need to investigate what happens in between async and wait calls.

mnaumovfb commented 1 year ago

I'm assuming this is resolved. Closing.