Trying to train UnSniffer on COCO, and getting NaN values in the sampling code of 2nd stage GOC training.

rohit901 commented 1 year ago

Hi, i'm trying to train this model on COCO dataset for 80k iters, keeping all the other parameters and config the same. The second stage VOS training starts from 12k iters as before.

However, i'm getting this error:

[08/18 01:00:16] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
  File "/home/rohit.bharadwaj/.conda/envs/un_sniffer/lib/python3.10/site-packages/detectron2/engine/train_loop.py", line 155, in train
    self.run_step()
  File "/home/rohit.bharadwaj/Documents/Projects/UnSniffer/detection/default_trainer.py", line 259, in run_step
    self._trainer.run_step()
  File "/home/rohit.bharadwaj/Documents/Projects/UnSniffer/detection/utils.py", line 68, in run_step
    loss_dict = self.model(data, self.iter)
  File "/home/rohit.bharadwaj/.conda/envs/un_sniffer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rohit.bharadwaj/Documents/Projects/UnSniffer/detection/modeling/plain_generalized_rcnn_logistic_gmm.py", line 164, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, iteration, gt_instances)
  File "/home/rohit.bharadwaj/.conda/envs/un_sniffer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rohit.bharadwaj/Documents/Projects/UnSniffer/detection/modeling/roihead_gmm.py", line 1057, in forward
    losses = self._forward_box(features, proposals, iteration, goc_samples)
  File "/home/rohit.bharadwaj/Documents/Projects/UnSniffer/detection/modeling/roihead_gmm.py", line 1383, in _forward_box
    prob_density = new_dis.log_prob(negative_samples)
  File "/home/rohit.bharadwaj/.conda/envs/un_sniffer/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py", line 214, in log_prob
    self._validate_sample(value)
  File "/home/rohit.bharadwaj/.conda/envs/un_sniffer/lib/python3.10/site-packages/torch/distributions/distribution.py", line 300, in _validate_sample
    raise ValueError(
ValueError: Expected value argument (Tensor of shape (10000, 1024)) to be within the support (IndependentConstraint(Real(), 1)) of the distribution MultivariateNormal(loc: torch.Size([1024]), covariance_matrix: torch.Size([1024, 1024])), but found invalid values:
tensor([[ 4.3391, 14.1614,  0.2073,  ...,     nan,     nan,     nan],
        [ 0.4429, 49.7287, -0.1949,  ...,     nan,     nan,     nan],
        [ 1.9357, 42.0719,  1.1280,  ...,     nan,     nan,     nan],
        ...,
        [ 4.2084,  9.5100,  0.6741,  ...,     nan,     nan,     nan],
        [ 2.1189, 28.9784, -0.3310,  ...,     nan,     nan,     nan],
        [ 2.6682, 24.7710, -0.3558,  ...,     nan,     nan,     nan]],
       device='cuda:0')

The issue seems to be in the below highlighted code block in roihead_gmm.py Screen Shot 2023-08-18 at 13 30 32 PM

Not sure why I'm getting nan values, is it because of data pre-processing? I'm using coco_2017_train dataset for train instead of the earlier voc_custom_train. All other config params are the same.

rohit901 commented 1 year ago

@Went-Liang, could you please point me in the right direction on how to resolve this issue? The training seems to work fine for the first stage (first 12k iters, the moment iter = 12k, I got this error)

rohit901 commented 1 year ago

Starting the VOS training stage from later iters [more than 12k] seemed to work. I guess the model prediction was not stable enough and hence it was not working when the start iter was 12k with COCO.

YH-2023 commented 1 year ago

@rohit901 Did you solve it, please?

rohit901 commented 1 year ago

@YH-2023 yes, just increase the start_iters of VOS training. I think default is 12k iters, increase it and increase total iters for training and it should work.

YH-2023 commented 1 year ago

@rohit901 What is the start_iters you set? What about metrics tested on the coco dataset with VOS?

Went-Liang / UnSniffer

Trying to train UnSniffer on COCO, and getting NaN values in the sampling code of 2nd stage GOC training. #12