MaskRCNN unable to train with master, works with previous revisions

larroy commented 4 years ago

Description

I can't train mask rcnn with latest revisions of MXNet:

https://gluon-cv.mxnet.io/build/examples_instance/train_mask_rcnn_coco.html

This revision works:

e9e267ef7 - (Sat, 14 Sep 2019 09:33:08 -0700) reminisce - Fix remaining errors reported by D2L (#16157)

This doesn't:

86ed5f5c0 - (Mon, 28 Oct 2019 01:24:05 -0700) Huang, Gua.. - [NumPy][Operator] NumPy operator may_share_memory and shares_memory (#16533) (upstream/v1.6.x)

I see very low throughput, high CPU usage and low GPU usage or it gets stuck completely.

This can be reproduced either from source or from the latest pip builds, so I don't think it's my environment or my build options.

This is my build environment:

USE_CUDA: "ON" # Build with CUDA support
USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
USE_NCCL: "ON" # Use NVidia NCCL with CUDA
USE_OPENCV: "ON" # Build with OpenCV support
USE_OPENMP: "PLATFORM" # Build with Openmp support
USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for search path
USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects support if "ON"
USE_LAPACK: "ON" # Build with lapack support
USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
USE_JEMALLOC: "ON" # Build with Jemalloc support
USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
USE_CPP_PACKAGE: "OFF" # Build C++ Package
USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
USE_GPROF: "OFF" # Compile with gprof (profiling) flag
USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set VTUNE_ROOT for search path
ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
INSTALL_EXAMPLES: "OFF" # Install the example source files.
USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric output
CMAKE_BUILD_TYPE: "Release"
CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
CMAKE_C_COMPILER_LAUNCHER: "ccache"
CMAKE_CXX_COMPILER_LAUNCHER: "ccache"

Diagnose

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping:            4
CPU MHz:             3134.070
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
----------Python Info----------
Version      : 3.6.8
Compiler     : GCC 8.3.0
Build        : ('default', 'Oct  7 2019 12:59:55')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.3.1
Directory    : /home/piotr/mxnet/py3_venv/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/piotr/mxnet/python/mxnet
Commit hash file "/home/piotr/mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/piotr/mxnet/python/mxnet/../../build/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✔ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-4.15.0-1052-aws-x86_64-with-Ubuntu-18.04-bionic
system       : Linux
node         : 18-232-106-45
release      : 4.15.0-1052-aws
version      : #54-Ubuntu SMP Tue Oct 1 15:43:26 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 0.4104 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0190 sec, LOAD: 0.0444 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0222 sec, LOAD: 0.3929 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0184 sec, LOAD: 0.3812 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0017 sec, LOAD: 0.0803 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0063 sec, LOAD: 0.0893 sec.
----------Environment----------
(END)

samskalicky commented 4 years ago

@zachgk assign @szha @Jerryzcn @zhreshold is this a GluonCV issue?

zhreshold commented 4 years ago

@larroy So are you fixing the version of GluonCV, only comparing the mxnet versions?

larroy commented 4 years ago

@zhreshold comparing MXNet versions. I think we should add training tests to Gluon CV CI, at least run a quick test to see that the model trains. Where is gluon cv CI hosted?

zhreshold commented 4 years ago

@larroy CI for GluonCV is hosted separately alongside with GluonNLP, GluonTS for example. So far we don't have nightly test and per-PR based training tests are too expensive.

larroy commented 4 years ago

I suggested to @Jerryzcn that training can be done for a few minutes to collect throughput and see that it works. You don't need to train a full model.

apache / mxnet

MaskRCNN unable to train with master, works with previous revisions #16675

Description

Diagnose