apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

[Bug] To Fix the Hang Problem in Training PSPNet #19056

Open barry-jin opened 4 years ago

barry-jin commented 4 years ago

Error when training PSPNet on Cityscapes dataset using GluonCV #17439

Problem Description

The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started.

Debugging

After bisect the date of failure, I find the first bad commit is PR 13896, which introduced this problem.

Proposed solutions

Need more efforts.

References

leezu commented 4 years ago

Why not fix the hang instead of disabling the feature?

sxjscience commented 4 years ago

This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to

In fact, we haven't used cuda calls like curand4 (curandStatePhilox4_32_10_t *state) when implementing the random operators.

sxjscience commented 4 years ago

In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.

zhreshold commented 4 years ago

+1 to @sxjscience , the segmentation model training adopts the DataParallel pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp