Open Pedrexus opened 2 years ago
Hi @Pedrexus,
First of all, thanks a lot for this ! I think this is a super important feature that is indeed really missing from VISSL.
Now, reading at the code, and based on the description of non reliable multi-node training, I think that the issue might come with the is_primary
function usage and the PersistentDict
.
What might happen is a race condition that leads of the the worker not to quit and get stuck, because of the way the multi-node training is done, it proceeds in locksteps where each worker has synchronisation points with the others.
I propose we try something which relies on standard PyTorch and might make it work better:
torch.distribution.all_reduce
on this tensor (with sum
as reducer)Could you please try something like this and tell me what happens?
Thank you again, Quentin
Hello @QuentinDuval,
thanks for the reply. This is a good idea, and I will try to implement it as soon as possible.
Thanks, Pedro
🚀 Feature
I wish to integrate Early Stopping into VISSL
Motivation & Examples
Early Stopping is an useful mechanism, already integrated in several libraries and frameworks, which can help when training several models for many epochs.
https://en.wikipedia.org/wiki/Early_stopping
Note
This is actually an ask for assistance, as I already have a working Early Stopping Hook, but it has not been very reliable in Multi GPU scenarios, in which the training just gets stuck when stop. Could you help me solve this problem?
Example of my early_stopping_hook.py file:
The
PersistentDict
works just like python builtinshelve
.I believe it might just require tweaking the
.stop_task(task)
method, but I have not been able to do it until now.