RuntimeError when training and evaluating on D2SA

Noiredd commented 3 years ago

Thanks for the interesting paper and easy to run implementation!

I tried following your training guide but got surprisingly stuck on the first step (D2SA training):

python tools/train_net.py --config-file configs/D2SA-AmodalSegmentation/mask_rcnn_R_50_FPN_1x_parallel_CtRef_VAR_SPRef_SPRet_FM.yaml

I modified the config to make a quick test, only changing the SOLVER entry in the Base-RCNN-FPN-D2SA.yaml as follows:

SOLVER:
  IMS_PER_BATCH: 1
  BASE_LR: 0.005
  MAX_ITER: 7000
  OPT_TYPE: "SGD"

(Single image per batch, less iterations, no stepping and no checkpoints.) I figured this kind of change would let me run the model quickly, see if everything works, and then I'd run the whole thing overnight.

After successfully training for 7000 iterations the process crashed at evaluation with the following traceback:

[08/18 15:54:49 d2.evaluation.evaluator]: Start embedding inference on training 1562 images
[08/18 15:54:51 d2.engine.hooks]: Overall training speed: 6997 iterations in 0:38:35 (0.3309 s / it)
[08/18 15:54:51 d2.engine.hooks]: Total training time: 0:43:57 (0:05:22 on hooks)
Traceback (most recent call last):
  File "../../repo/tools/train_net.py", line 173, in <module>
    args=(args,),
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/engine/launch.py", line 51, in launch
    main_func(*args)
  File "../../repo/tools/train_net.py", line 161, in main
    return trainer.train()
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/engine/defaults.py", line 416, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/engine/train_loop.py", line 133, in train
    self.after_step()
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/engine/train_loop.py", line 151, in after_step
    h.after_step()
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/engine/hooks.py", line 325, in after_step
    results = self._func()
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/engine/defaults.py", line 366, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.model)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/engine/defaults.py", line 528, in test
    model = embedding_inference_on_train_dataset(model, embedding_dataloader)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/evaluation/evaluator.py", line 259, in embedding_inference_on_train_dataset
    model(inputs)
  File "/home/pdolata/Programy/detectron_amodal/davenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/modeling/meta_arch/rcnn.py", line 109, in forward
    return self.inference(batched_inputs, do_postprocess=do_postprocess)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/modeling/meta_arch/rcnn.py", line 166, in inference
    results, _ = self.roi_heads(images, features, proposals, gt_instances)
  File "/home/pdolata/Programy/detectron_amodal/davenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/modeling/roi_heads/roi_heads.py", line 862, in forward
    pred_instances = self.forward_with_given_boxes(features, pred_instances)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/modeling/roi_heads/roi_heads.py", line 891, in forward_with_given_boxes
    instances = self._forward_mask(features, instances)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/modeling/roi_heads/roi_heads.py", line 1008, in _forward_mask
    mask_logits, _ = self.mask_head(mask_features, instances)
  File "/home/pdolata/Programy/detectron_amodal/davenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/modeling/roi_heads/mask_head.py", line 817, in forward
    k=self.SPk).detach()
  File "/home/pdolata/Programy/detectron_amodal/repo/detectron2/modeling/roi_heads/recon_net.py", line 323, in nearest_decode
    indices = torch.topk(- distances, k)[1]
RuntimeError: invalid argument 5: k not in range for dimension at /pytorch/aten/src/THC/generic/THCTensorTopK.cu:23

I know this error pops up when one tries to run torch.topk with k greater than the number of classes, but honestly I have no idea where to look for the source of this in the model configs. Any hints?

YutingXiao commented 3 years ago

"indices = torch.topk(- distances, k)[1]" aims to find the top k nearest category-specific features in the codebook. It seems that the dimension of the codebook is not correct? I suggest checking the dimension of the "distances" and "codebook".

Noiredd commented 1 year ago

Just found this ancient issue which by the way has nevered occurred again after I reinstalled your repo. Must've been something with dependencies.

YutingXiao / Amodal-Segmentation-Based-on-Visible-Region-Segmentation-and-Shape-Prior

RuntimeError when training and evaluating on D2SA #10