Open barry-jin opened 4 years ago
Why not fix the hang instead of disabling the feature?
This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to
In fact, we haven't used cuda calls like curand4 (curandStatePhilox4_32_10_t *state) when implementing the random operators.
In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.
+1 to @sxjscience , the segmentation model training adopts the DataParallel
pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp
Error when training PSPNet on Cityscapes dataset using GluonCV #17439
Problem Description
The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started.
Debugging
After bisect the date of failure, I find the first bad commit is PR 13896, which introduced this problem.
Proposed solutions
Need more efforts.
References