Open Hiusam opened 1 year ago
A few questions and comments:
Did you run the config with the same batch size, learning rate and schedule that we suggest? Deviating from the recipe we suggest will certainly change the behavior of the losses during training (as is expected)
Yes, occasionally we do encounter high losses during training. This is because an image might be out of distribution or have extreme annotations -- something that 3D suffers more from compared to 2D. For this reason, we provide checks and skip gradient updates for these cases. The model, given you use the recipe we provide, should have trained successfully though.
!! Restarting training at 51028 iters. Exploding loss 2% of iters !!
. Maybe I should keep training and hope after some restarting, the training will be complete? :(
- Large losses during training
And to confirm you ran with the same batch size. You should certainly keep training the model. We skip updates in the case of large losses to make training robust. The training should complete.
- Gradient clip
Gradient clip is another way to secure your model from large losses (and thus large gradients). We chose to skip the updates; gradient clipping clips the gradients. Our way of skipping updates when losses are large is certainly less aggressive than gradient clipping which is why we prefer it.
- Clear the dataset
@Hiusam this is not a dataset issue. There is nothing in the dataset to clear. 3D detection is simply much much harder than 2D detection. For instance, there are scenes with really far away objects (e.g. scenes with objects as far as 200m) in which case a wrong depth prediction in metric space will produce a large loss and thus large gradients. The solution is not to "clear" the dataset in any way, but to robustify training, which we do.
Hi @gkioxari , nice work! I also encountered the same issue. Do you have an estimation of how many "retry" it usually needs? It seems that my experiments have been retried many times, e.g.., EST is 21 hours yet after two days it is still in retry. I also use the same Base_Omni3D_out config without any changes. Your suggestions would be very helpful!
I encountered the same problem. Without modifying the code, the training loss explodes during training in both indoor and outdoor scenes. I have tried resuming the experiments from the saved checkpoint, but it does not work. The training loss explodes soon again.
Hi, I ran your code with Base_Omni3D_out config and encountered
after iteration 43k.
I also found that scaling up the batch size to 160 made the model even easier to encounter
Skipping gradient update due to higher than normal loss
.Is this a normal phenomenon? I ran the code with 8 A100 GPUS. My environment is:
Thank you.