JMGaljaard / fltk-testbed

BSD 2-Clause "Simplified" License
8 stars 62 forks source link

ResNet not working? CNN works fine #11

Closed jeongwoopark0514 closed 3 years ago

jeongwoopark0514 commented 3 years ago

Hi,

I was running some experiments on Gcloud and for some reason, ResNet and VGG does not work. I thought it was my docker issue at first regarding container, but when I changed to CNN, it works perfectly fine.

  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/federation-lab/fltk/__main__.py", line 66, in <module>
    __main__()
  File "/opt/federation-lab/fltk/__main__.py", line 34, in __main__
    client_start(arguments, config)
  File "/opt/federation-lab/fltk/__main__.py", line 57, in client_start
    launch_client(task_id, config=configuration, learning_params=learning_params, namespace=args)
  File "/opt/federation-lab/fltk/launch.py", line 54, in launch_client
    epoch_data = client.run_epochs()
  File "/opt/federation-lab/fltk/client.py", line 208, in run_epochs
    train_loss = self.train(epoch)
  File "/opt/federation-lab/fltk/client.py", line 135, in train
    outputs = self.model(inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/federation-lab/fltk/nets/fashion_mnist_resnet.py", line 60, in forward
    y = self.block3(y)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/pooling.py", line 615, in forward
    return F.avg_pool2d(input, self.kernel_size, self.stride,
RuntimeError: Given input size: (512x1x1). Calculated output size: (512x0x0). Output size is too small

This is a log from FashionMNISTResNet.

I also get an error of "Backoff Restarting failed container" when I check kubernetes dashboard.

Is there sth I should modify before running ResNet code? Could you look into this?

JMGaljaard commented 3 years ago

Hi @jeongwoopark0514, I ran the code locally and could reproduce your error running with the following command:

MASTER_ADDR=localhost MASTER_PORT=5000 RANK=0 WORLD_SIZE=1 python3 -m fltk client configs/example_cloud_experiment.json ee232974-dcde-4977-8f3d-40bf1accabb2 --model FashionMNISTResNet --datase
t MNIST --optimizer Adam --max_epoch 5 --batch_size 128 --learning_rate 0.01 --decay 0.0002 --loss CrossEntropy

The culprit is the AveragePool2D layer that is used, that doesn't provide padding. The resulting shape of the calculation would be [batch_size, 512, 0, 0] as the error indicates, which indeed cannot happen. The reason is because the AvgPool2d layer is set with a kernel of 3, and doesn't use zero padding.

The solution would be to set the block3 in fashion_mnist_resnet.py to have a padding=1 to compensate for the pixels that we lose, or use AdaptiveAvgPool2d, which is also used in the PyTorch Vision repo.

I will update code by using AdaptiveAvgPool2d, as this seems to be more commonly used.

JMGaljaard commented 3 years ago

The incorporated changes should address the Exceptions that you ran into. I'll close the issue, but feel free to re-open the issue/create a new issue!