JosephKJ / OWOD

(CVPR 2021 Oral) Open World Object Detection
https://josephkj.in
Apache License 2.0
1.04k stars 155 forks source link

Fixes the 'local variable 'length' referenced before assignment' error while resuming training. #42

Closed mmaaz60 closed 3 years ago

mmaaz60 commented 3 years ago

While resuming training, self.means was not been assigned before calling self.clstr_loss_l2_cdist which was causing this error. Closses issue 41.

JosephKJ commented 3 years ago

self.means was not been assigned before calling self.clstr_loss_l2_cdist bacuase of inconsistency in clustering start iteration. Please refer this. Do you still feel that this fix is required?

mmaaz60 commented 3 years ago

self.means was not been assigned before calling self.clstr_loss_l2_cdist because of inconsistency in clustering start iteration. Please refer this. Do you still feel that this fix is required?

Thank you,

It seems like the reason for this error in issue 41 was different and does not require this fix. However, I got this error at different stages, the culprit was the same (i.e. inconsistency in clustering start iteration) but the reasons were different. Details are below,

  1. In case, for example in Task 1, if you want to start the clustering from the 0th iteration, so you need to set START_ITER: -1 in Base-RCNN-C4-OWOD.yaml. This will create the same error because in the 0th iteration we will attempt to access self.means entries which are all None at the moment. Hence, the current code has the constraint START_ITER: >= 0, which means at least 1 iteration before you start clustering.
  2. Now, for example, consider somehow your training crashes in the middle of Task 4 training (t4_train.yaml). So, now if you resume the training, the code will load the last saved checkpoints and set the storage.iter > self.clustering_start_iter. Again, as it is the first iteration after resuming and self.means has not been set from feature_store yet, so it will initiate the same error.

In fact, for me, this error occurred when I resumed my Task 4 training after it crashed because of a CUDA out of memory error (CUDA memory error was not because of this codebase). Furthermore, the solution I implemented is one of the several possible solutions and it might not be the best. Please guide me if I am missing anything.

Thanks

JosephKJ commented 3 years ago

in Task 1, if you want to start the clustering from the 0th iteration

There is a burn-in period before we do clustering. Please refer to the last paragraph of Sec 4.1 "We start computing the loss only after a certain number of burn-in iterations (I_b) are completed". Hence this use-case doesn't come up.

if you resume the training

Even here, while restarting, we can give the iter from which we are restarting as the clustering START_ITER. But in such situation (premature termination), there is a bigger concern: feature_store will not be stored. The best option would be to redo again so that the prototypes for the newer classes would be intact.

Btw, there is a related point that I think would be nice to share: clustering START_ITER need not be iter at which the last task terminates (i). It can be anything in between i and the next UPDATE_MU_ITER.

Hence, a quicker way to restart from a premature termination would be to set START_ITER just before the next immediate UPDATE_MU_ITER.

If this discussion confuses you, I would be happy to clarify more.

mmaaz60 commented 3 years ago

in Task 1, if you want to start the clustering from the 0th iteration

There is a burn-in period before we do clustering. Please refer to the last paragraph of Sec 4.1 "We start computing the loss only after a certain number of burn-in iterations (I_b) are completed". Hence this use-case doesn't come up.

if you resume the training

Even here, while restarting, we can give the iter from which we are restarting as the clustering START_ITER. But in such situation (premature termination), there is a bigger concern: feature_store will not be stored. The best option would be to redo again so that the prototypes for the newer classes would be intact.

Btw, there is a related point that I think would be nice to share: clustering START_ITER need not be iter at which the last task terminates (i). It can be anything in between i and the next UPDATE_MU_ITER.

Hence, a quicker way to restart from a premature termination would be to set START_ITER just before the next immediate UPDATE_MU_ITER.

If this discussion confuses you, I would be happy to clarify more.

Thank you, I understood your point.