Closed joe-redstone closed 6 years ago
Yes hogwild training is a lock-free approach to parallelizing. It gpu's are great for matrix computation but they require locks when sharing parameters. This style of training takes an advantage that cpu's have over gpu's that you can share parameters without locks. Yes this does give way to bad updates but they happen fairly infrequently and actually may make the model more robust as it adds some more noise to it. Overall, any negative is far outweighed by the increase in speed of updates as it takes times to acquire and release locks. I know this can be hard to trust and had myself wasted a lot of time analyzing the with locks to without lock approach but trust in the hog wild training. You are correct in its efficiency being poor due to overwrites but in increases in speed of number of updates drastically so that overall is much faster learning.
The ensure_shared_grads function just makes sure the state initialization goes fine in beginning of training and after that it really shouldn't be a factor.
sorry kinda in a rush to get somewhere so wrote this quick. Will address more a little later
Thanks for your reply. I think I understand the logic ensure_shared_grad function better now. shared_param itself is actually shared by workers, but shared_param.grad is NOT shared. It is an alias/reference to local copy of param.grad. When param.grad gets updated, shared_param.grad gets updated automatically. As you said, once it is initially set, it doesn’t need to be set again.
Browsing the code, I can't help but noticing there are no synchronization among workers, i.e., using Lock mechanism to coordinate the updating of shared_model by different workers. Is this how the "hogwild" algorithm works? I have browsed several Pytorch implementation of A3C. All seem to share the same model updating mechanism. Here are my specific questions I wonder if you can enlighten me or confirm my understanding:
def ensure_shared_grads(model, shared_model): for param, shared_param in zip(model.parameters(), shared_model.parameters()): if shared_param.grad is not None: return shared_param._grad = param.grad
Assuming my understanding of above code is wrong, shared_model keeps getting its new grad from worker A. However it is possible that before optim.step() is being executed in worker A, worker B will have updated grad for shared_model too (partially or completely). So by the time optim.step() is finished, the grad used to update param could be either coming from worker A, or from worker B, or a mix of both. Is it true?
If my above statement is true, then this way of updating model param seems to be very inefficient. Original A3C paper seems to mention periodical synchronization not sync at all times. It may help increasing stability and converging speed. Just a thought.
Thank you.