apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

SSD Training fails with free pointer issue during end of training #19024

Open karan6181 opened 4 years ago

karan6181 commented 4 years ago

1. Without Horovod:

Cmd:

python gluon-cv/scripts/detection/ssd/train_ssd.py --gpus 0,1,2,3,4,5,6,7 -j 32 --network resnet50_v1 --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 1 --batch-size 64 --log-interval 100 --val-interval 20 --save-interval 20

Failure:

free(): invalid pointer

Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd-log

2. With Horovod:

Cmd:

horovodrun -np 8 python gluon-cv/scripts/detection/ssd/train_ssd.py -j 32 --network resnet50_v1 --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 1 --horovod --batch-size 64 --log-interval 100 --val-interval 20 --save-interval 20

Failure:

[1,1]<stderr>:corrupted size vs. prev_size
[1,1]<stderr>:[ip-100-64-13-241:09515] *** Process received signal ***
[1,1]<stderr>:[ip-100-64-13-241:09515] Signal: Aborted (6)
[1,1]<stderr>:[ip-100-64-13-241:09515] Signal code:  (-6)
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7fb2d87948a0]
[1,1]<stderr>:[ip-100-64-13-241:09515] [1,1]<stderr>:[ 1] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fb2d83cff47]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 2] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fb2d83d18b1]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 3] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x89907)[0x7fb2d841a907]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 4] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x9097a)[0x7fb2d842197a]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 5] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x90b7c)[0x7fb2d8421b7c]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 6] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x94848)[0x7fb2d8425848]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 7] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x27d)[0x7fb2d842835d]
[1,1]<stderr>:[ip-100-64-13-241:09515] [1,1]<stderr>:[ 8] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/bin/../lib/libstdc++.so.6(_Znwm+0x15)[0x7fb269b344e5]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 9] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38b43cd)[0x7fb28d8dd3cd]
[1,1]<stderr>:[ip-100-64-13-241:09515] [10] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38ba8c6)[0x7fb28d8e38c6]
[1,1]<stderr>:[ip-100-64-13-241:09515] [11] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bac16)[0x7fb28d8e3c16]
[1,1]<stderr>:[ip-100-64-13-241:09515] [12] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bfe60)[0x7fb28d8e8e60]

Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd_horovod_single_node-log

GluonCV: 0.8.0 (build from source)

Horovod:

Horovod v0.19.5:

Available Frameworks:
    [ ] TensorFlow
    [ ] PyTorch
    [X] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

MXNet Diagnosis:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping:            4
CPU MHz:             1200.041
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
----------Python Info----------
Version      : 3.6.10
Compiler     : GCC 7.3.0
Build        : ('default', 'Mar 25 2020 23:51:54')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 6de57440b792dca716f1214a81edf557c345fddb
Library      : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
system       : Linux
node         : ip-100-64-13-241
release      : 5.3.0-1032-aws
version      : #34~18.04.2-Ubuntu SMP Fri Jul 24 10:06:28 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0014 sec, LOAD: 0.3844 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0012 sec, LOAD: 0.0220 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0005 sec, LOAD: 0.0184 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 0.1442 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0035 sec, LOAD: 0.0546 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0004246234893798828 sec.
----------Environment----------
KMP_DUPLICATE_LIB_OK="True"
KMP_INIT_AT_FORK="FALSE"
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping:            4
CPU MHz:             1305.290
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
----------Python Info----------
Version      : 3.6.10
Compiler     : GCC 7.3.0
Build        : ('default', 'Mar 25 2020 23:51:54')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 6de57440b792dca716f1214a81edf557c345fddb
Library      : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
system       : Linux
node         : ip-100-64-13-241
release      : 5.3.0-1032-aws
version      : #34~18.04.2-Ubuntu SMP Fri Jul 24 10:06:28 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0026 sec, LOAD: 0.3870 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0012 sec, LOAD: 0.0253 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0005 sec, LOAD: 0.3219 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.1079 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0008 sec, LOAD: 0.0563 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0004470348358154297 sec.
----------Environment----------
KMP_DUPLICATE_LIB_OK="True"
KMP_INIT_AT_FORK="FALSE"
karan6181 commented 4 years ago

Initially, I created an issue (https://github.com/dmlc/gluon-cv/issues/1415) in Gluon CV thinking that this might be related to script issue. But by root causing that issue, I found that by adding mx.nd.waitall() at the end of the script, I dont see that crash anymore. From my understanding (correct me if I am wrong), One shouldn't call the mx.nd.waitall() explicitly and the MXNet engine should be able to release tensors accordingly after the operation has finished.

Is this a bug in MXNet or am i missing something here?

leezu commented 4 years ago

It's probably fixed by https://github.com/apache/incubator-mxnet/pull/18768 You can apply that commit to the 1.6 branch and check if the issue persists

karan6181 commented 4 years ago

Thanks @leezu . I will try that patch and let u know.

szha commented 3 years ago

@karan6181 any update?

austinmw commented 3 years ago

Hi, I'm getting corrupted size vs. prev_size with: horovodrun -np 4 -H localhost:4 python train_faster_rcnn.py --dataset coco --horovod --disable-hybridization --batch-size 4