Open larroy opened 4 years ago
@zachgk assign @szha @Jerryzcn @zhreshold is this a GluonCV issue?
@larroy So are you fixing the version of GluonCV, only comparing the mxnet versions?
@zhreshold comparing MXNet versions. I think we should add training tests to Gluon CV CI, at least run a quick test to see that the model trains. Where is gluon cv CI hosted?
@larroy CI for GluonCV is hosted separately alongside with GluonNLP, GluonTS for example. So far we don't have nightly test and per-PR based training tests are too expensive.
I suggested to @Jerryzcn that training can be done for a few minutes to collect throughput and see that it works. You don't need to train a full model.
Description
I can't train mask rcnn with latest revisions of MXNet:
https://gluon-cv.mxnet.io/build/examples_instance/train_mask_rcnn_coco.html
This revision works:
e9e267ef7 - (Sat, 14 Sep 2019 09:33:08 -0700) reminisce - Fix remaining errors reported by D2L (#16157)
This doesn't:
86ed5f5c0 - (Mon, 28 Oct 2019 01:24:05 -0700) Huang, Gua.. - [NumPy][Operator] NumPy operator
may_share_memory
andshares_memory
(#16533) (upstream/v1.6.x)I see very low throughput, high CPU usage and low GPU usage or it gets stuck completely.
This can be reproduced either from source or from the latest pip builds, so I don't think it's my environment or my build options.
This is my build environment:
Diagnose