[Bug] To Fix the Hang Problem in Training PSPNet

apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

https://mxnet.apache.org

Apache License 2.0

20.77k stars 6.79k forks source link

[Bug] To Fix the Hang Problem in Training PSPNet #19056

Open barry-jin opened 4 years ago

barry-jin commented 4 years ago

Error when training PSPNet on Cityscapes dataset using GluonCV #17439

Problem Description

The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started.

Debugging

After bisect the date of failure, I find the first bad commit is PR 13896, which introduced this problem.

Proposed solutions

Need more efforts.

References

list reference and related literature Issue #17439, PR #13896
list known implementations

leezu commented 4 years ago

Why not fix the hang instead of disabling the feature?

sxjscience commented 4 years ago

This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to

Fix cudnn dropout
Consider to drop CuDNN Dropout if we can accelerate our native dropout

In fact, we haven't used cuda calls like curand4 (curandStatePhilox4_32_10_t *state) when implementing the random operators.

sxjscience commented 4 years ago

In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.

zhreshold commented 4 years ago

+1 to @sxjscience , the segmentation model training adopts the DataParallel pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp