Closed mmaaz60 closed 3 years ago
self.means was not been assigned before calling self.clstr_loss_l2_cdist bacuase of inconsistency in clustering start iteration. Please refer this. Do you still feel that this fix is required?
self.means was not been assigned before calling self.clstr_loss_l2_cdist because of inconsistency in clustering start iteration. Please refer this. Do you still feel that this fix is required?
Thank you,
It seems like the reason for this error in issue 41 was different and does not require this fix. However, I got this error at different stages, the culprit was the same (i.e. inconsistency in clustering start iteration) but the reasons were different. Details are below,
START_ITER: -1
in Base-RCNN-C4-OWOD.yaml
. This will create the same error because in the 0th iteration we will attempt to access self.means
entries which are all None
at the moment. Hence, the current code has the constraint START_ITER: >= 0
, which means at least 1 iteration before you start clustering.t4_train.yaml
). So, now if you resume the training, the code will load the last saved checkpoints and set the storage.iter > self.clustering_start_iter
. Again, as it is the first iteration after resuming and self.means
has not been set from feature_store
yet, so it will initiate the same error. In fact, for me, this error occurred when I resumed my Task 4 training after it crashed because of a CUDA out of memory error (CUDA memory error was not because of this codebase). Furthermore, the solution I implemented is one of the several possible solutions and it might not be the best. Please guide me if I am missing anything.
Thanks
in Task 1, if you want to start the clustering from the 0th iteration
There is a burn-in period before we do clustering. Please refer to the last paragraph of Sec 4.1 "We start computing the
if you resume the training
Even here, while restarting, we can give the iter from which we are restarting as the clustering START_ITER. But in such situation (premature termination), there is a bigger concern: feature_store will not be stored. The best option would be to redo again so that the prototypes for the newer classes would be intact.
Btw, there is a related point that I think would be nice to share: clustering START_ITER need not be iter at which the last task terminates (i). It can be anything in between i and the next UPDATE_MU_ITER.
Hence, a quicker way to restart from a premature termination would be to set START_ITER just before the next immediate UPDATE_MU_ITER.
If this discussion confuses you, I would be happy to clarify more.
in Task 1, if you want to start the clustering from the 0th iteration
There is a burn-in period before we do clustering. Please refer to the last paragraph of Sec 4.1 "We start computing the loss only after a certain number of burn-in iterations (I_b) are completed". Hence this use-case doesn't come up.
if you resume the training
Even here, while restarting, we can give the iter from which we are restarting as the clustering START_ITER. But in such situation (premature termination), there is a bigger concern: feature_store will not be stored. The best option would be to redo again so that the prototypes for the newer classes would be intact.
Btw, there is a related point that I think would be nice to share: clustering START_ITER need not be iter at which the last task terminates (i). It can be anything in between i and the next UPDATE_MU_ITER.
Hence, a quicker way to restart from a premature termination would be to set START_ITER just before the next immediate UPDATE_MU_ITER.
If this discussion confuses you, I would be happy to clarify more.
Thank you, I understood your point.
While resuming training,
self.means
was not been assigned before callingself.clstr_loss_l2_cdist
which was causing this error. Closses issue 41.