Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started. Checking the GPU status shows that the memory is used but no program is running, GPU utilization is 0. The screenshot is attached below.
By the way, due to the use of the functions in Iterm2 (when idle, send ascii code 0 every 60 seconds) to keep the connection with server, so each @ means waited 60 seconds. The figure below could show the stuck problem more clearly.
This "hang" situation happens for both MXNet 1.5.1 and 1.6.0. But the training is fine using MXNet 1.4.1.
I tried to train the same network on other datasets like ADE20K, there is no problem.
I tried to reduce the crop size from 768 to 480 in the command line above using MXNet 1.5.1, there is no problem.
I tried MXNet 1.4.1, there is no problem.
So I suspect starting from MXNet 1.5.0, there are some changes that lead to this situation, we can't handle large input/tensor. An input of 768x768x3 will hang, but an input of 480x480 will go through.
Description
The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started. Checking the GPU status shows that the memory is used but no program is running, GPU utilization is 0. The screenshot is attached below.
By the way, due to the use of the functions in Iterm2 (when idle, send ascii code 0 every 60 seconds) to keep the connection with server, so each
@
means waited 60 seconds. The figure below could show the stuck problem more clearly.This "hang" situation happens for both MXNet 1.5.1 and 1.6.0. But the training is fine using MXNet 1.4.1.
Environment
To Reproduce
Setup environment
Download cityscapes dataset following: https://gluon-cv.mxnet.io/build/examples_datasets/cityscapes.html
Run training command,
What have you tried to solve it?
So I suspect starting from MXNet 1.5.0, there are some changes that lead to this situation, we can't handle large input/tensor. An input of 768x768x3 will hang, but an input of 480x480 will go through.