Open karan6181 opened 4 years ago
Initially, I created an issue (https://github.com/dmlc/gluon-cv/issues/1415) in Gluon CV thinking that this might be related to script issue. But by root causing that issue, I found that by adding mx.nd.waitall()
at the end of the script, I dont see that crash anymore. From my understanding (correct me if I am wrong), One shouldn't call the mx.nd.waitall()
explicitly and the MXNet engine should be able to release tensors accordingly after the operation has finished.
Is this a bug in MXNet or am i missing something here?
It's probably fixed by https://github.com/apache/incubator-mxnet/pull/18768 You can apply that commit to the 1.6 branch and check if the issue persists
Thanks @leezu . I will try that patch and let u know.
@karan6181 any update?
Hi, I'm getting corrupted size vs. prev_size
with:
horovodrun -np 4 -H localhost:4 python train_faster_rcnn.py --dataset coco --horovod --disable-hybridization --batch-size 4
free(): invalid pointer
orcorrupted size vs. prev_size
1. Without Horovod:
Cmd:
Failure:
Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd-log
2. With Horovod:
Cmd:
Failure:
Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd_horovod_single_node-log
GluonCV: 0.8.0 (build from source)
Horovod:
MXNet Diagnosis: