Closed PavithranRick closed 1 year ago
I would start by replacing synchronous calls with (i) asynchronous calls followed by immediate wait, and (ii) then move the wait calls farther from the original point to the desired location.
If (i) is producing a Nan you need to understand whether the async+wait combination is doing what you have expected, otherwise (ii) you need to investigate what happens in between async and wait calls.
I'm assuming this is resolved. Closing.
I'm trying to convert the synchronous distributed training of DLRM to asynchronous distributed training. I understand that the asynchronous distributed training might not result in a model that has the same model accuracy as synchronous distributed training.
Could you please help me with starting points for this sync to async distributed training conversion?
P.S: We removed the blocking wait points in the code and tried the running code but this results in NaN value for gradients.