Training Error - Githubissues

dlee640 commented 3 years ago

Hi Team, First of all thank you for providing open source of such good work.

I have the pretrained model in trained_models/sunrgbd/r34_NBt1D.pth and I got the eval.py to run successfully with mIoU replicating whats on the paper.

My goal is to mess with the parameters in args.py to compare the results to see what I can improve.

I have been trying to train the model only using --modality rgb. I get the following error upon using train.py:

(rgbd_segmentation) dlee640@dlee640-lenovo:~/ESANet$ python train.py \
>     --dataset sunrgbd \
>     --dataset_dir ./datasets/sunrgbd \
>     --pretrained_dir ./trained_models/sunrgbd \
>     --results_dir ./results \
>     --modality rgb \
> 
Compute class weights
5285/5285
Saved class weights under /home/dlee640/ESANet/src/datasets/sunrgbd/weighting_median_frequency_1+37_train.pickle.
/home/dlee640/ESANet/src/build_model.py:29: UserWarning: Argument --channels_decoder is ignored when --decoder_chanels_mode decreasing is set.
  warnings.warn('Argument --channels_decoder is ignored when '
/home/dlee640/ESANet/src/models/resnet.py:101: UserWarning: parameters groups, base_width and norm_layer are ignored in NonBottleneck1D
  warnings.warn('parameters groups, base_width and norm_layer are '
Loaded r34 with encoder block NonBottleneck1D pretrained on ImageNet
./trained_models/sunrgbd/r34_NBt1D.pth
/home/dlee640/ESANet/src/models/model_one_modality.py:139: UserWarning: for the context module the learned upsampling is not possible as the feature maps are not upscaled by the factor 2. We will use nearest neighbor instead.
  warnings.warn('for the context module the learned upsampling is '
Traceback (most recent call last):
  File "train.py", line 552, in <module>
    train_main()
  File "train.py", line 92, in train_main
    model, device = build_model(args, n_classes=n_classes_without_void)
  File "/home/dlee640/ESANet/src/build_model.py", line 87, in build_model
    upsampling=args.upsampling
  File "/home/dlee640/ESANet/src/models/model_one_modality.py", line 164, in __init__
    num_classes=num_classes
TypeError: __init__() got an unexpected keyword argument 'height'

Why is it giving this error? How can I address this issue?

UPDATE: I fixed this issue by commenting out height and width parameters in model_one_modality.py. The part I commented out is shown below. But I have a new issue.

        # decoder
        self.decoder = Decoder(
            channels_in=channels_after_context_module,
            channels_decoder=channels_decoder,
            activation=self.activation,
            nr_decoder_blocks=nr_decoder_blocks,
            encoder_decoder_fusion=encoder_decoder_fusion,
            **height=height,
            width=width,**
            upsampling_mode=upsampling,
            num_classes=num_classes
        )

Upon running train.py, the process randomly gets terminated with 'Killed" message..

(rgbd_segmentation) dlee640@dlee640-lenovo:~/ESANet$ python train.py \
>     --dataset sunrgbd \
>     --dataset_dir ./datasets/sunrgbd \
>     --pretrained_dir ./trained_models/sunrgbd \
>     --results_dir ./results \
>     --height 480 \
>     --width 640 \
>     --batch_size 8 \
>     --batch_size_valid 24 \
>     --lr 0.01 \
>     --optimizer SGD \
>     --class_weighting median_frequency \
>     --encoder resnet34 \
>     --encoder_block NonBottleneck1D \
>     --nr_decoder_blocks 3 \
>     --modality rgb \
>     --encoder_decoder_fusion add \
>     --context_module ppm \
>     --decoder_channels_mode decreasing \
>     --fuse_depth_in_rgb_encoder SE-add \
>     --upsampling learned-3x3-zeropad 

Compute class weights
5285/5285
Saved class weights under /home/dlee640/ESANet/src/datasets/sunrgbd/weighting_median_frequency_1+37_train.pickle.
/home/dlee640/ESANet/src/build_model.py:29: UserWarning: Argument --channels_decoder is ignored when --decoder_chanels_mode decreasing is set.
  warnings.warn('Argument --channels_decoder is ignored when '
/home/dlee640/ESANet/src/models/resnet.py:101: UserWarning: parameters groups, base_width and norm_layer are ignored in NonBottleneck1D
  warnings.warn('parameters groups, base_width and norm_layer are '
Loaded r34 with encoder block NonBottleneck1D pretrained on ImageNet
./trained_models/sunrgbd/r34_NBt1D.pth
/home/dlee640/ESANet/src/models/model_one_modality.py:139: UserWarning: for the context module the learned upsampling is not possible as the feature maps are not upscaled by the factor 2. We will use nearest neighbor instead.
  warnings.warn('for the context module the learned upsampling is '
Device: cpu
ESANetOneModality(
  (activation): ReLU(inplace=True)
  (encoder): ResNet(
    (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (act): ReLU(inplace=True)
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (layer1): Sequential(
      (0): NonBottleneck1D(
        (conv3x1_1): Conv2d(64, 64, kernel_size=(3, 1), stride=(1, 1), padding=(1, 0))
        (conv1x3_1): Conv2d(64, 64, kernel_size=(1, 3), stride=(1, 1), padding=(0, 1))
        (bn1): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
...
...
...
...
...
        (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128)
      )
      (side_output): Conv2d(128, 37, kernel_size=(1, 1), stride=(1, 1))
    )
    (conv_out): Conv2d(128, 37, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (upsample1): Upsample(
      (pad): Identity()
      (conv): Conv2d(37, 37, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=37)
    )
    (upsample2): Upsample(
      (pad): Identity()
      (conv): Conv2d(37, 37, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=37)
    )
  )
)
Compute class weights
5050/5050
Saved class weights under /home/dlee640/ESANet/src/datasets/sunrgbd/weighting_linear_1+37_test.pickle.
Using SGD as optimizer
Unfreezing
/home/dlee640/anaconda3/envs/rgbd_segmentation/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
**Killed**

I get that "Killed" argument at the end and the training abruptly terminates! I cannot find any constructive logs on what actually happened, so my progress is halted. I tried both default training settings and the rgb only settings for training, and the same error happens. What is happening?

dlee640 commented 3 years ago

Apologies for spamming. This issue was because my computer is not good enough! moving onto google colab.

mona0809 commented 3 years ago

Thanks for the hint. The bug came from some refactoring before releasing the code. I fixed it now.

danielS91 commented 3 years ago

see 08dff34c549020cb682c9801ed1788f25a7156eb

TUI-NICR / ESANet

Training Error #8