apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Error when training PSPNet on Cityscapes dataset using GluonCV #17439

Open KuangHaofei opened 4 years ago

KuangHaofei commented 4 years ago

Description

The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started. Checking the GPU status shows that the memory is used but no program is running, GPU utilization is 0. The screenshot is attached below.

gpu_status

By the way, due to the use of the functions in Iterm2 (when idle, send ascii code 0 every 60 seconds) to keep the connection with server, so each @ means waited 60 seconds. The figure below could show the stuck problem more clearly.

error

This "hang" situation happens for both MXNet 1.5.1 and 1.6.0. But the training is fine using MXNet 1.4.1.

Environment

To Reproduce

  1. Setup environment

    source activate mxnet_p36
    pip install mxnet-cu101==1.5.1 (or 1.6.0b20191122)
    pip install gluoncv
  2. Download cityscapes dataset following: https://gluon-cv.mxnet.io/build/examples_datasets/cityscapes.html

  3. Run training command,

    cd ~/
    git clone https://github.com/dmlc/gluon-cv.git
    cd gluon-cv/scripts/segmentation/
    python train.py --dataset citys --model psp --aux --backbone resnet101 --syncbn --ngpus 8 --lr 0.01 --epochs 240 --base-size 2048 --crop-size 768 --workers 32

    What have you tried to solve it?

    • I tried to train the same network on other datasets like ADE20K, there is no problem.
    • I tried to reduce the crop size from 768 to 480 in the command line above using MXNet 1.5.1, there is no problem.
    • I tried MXNet 1.4.1, there is no problem.

So I suspect starting from MXNet 1.5.0, there are some changes that lead to this situation, we can't handle large input/tensor. An input of 768x768x3 will hang, but an input of 480x480 will go through.

zhreshold commented 4 years ago

If you can bisect the date of failure then we can better locate the PR which introduced the problem.