Open lorenzob opened 4 years ago
Should have already been fixed in #16194. Could you try a nightly build to verify?
With the nightly first I got this error:
https://github.com/apache/incubator-mxnet/issues/16785
and I solved(?) it by setting MXNET_USE_FUSION=0.
After that, on the first call after the OOM I get (from another model, not the same that threw the OOM error):
Traceback (most recent call last):
File "core/facerec.py", line 676, in <module>
img_result, distances = compare(list(images_data), "work/")
File "core/facerec.py", line 142, in compare
images, aligned_images, detection_boxes = load_and_align_data(image_files, image_size, margin, gpu_memory_fraction, work_dir)
File "core/facerec.py", line 368, in load_and_align_data
result = run_detection(i, factor, image_size, img_data, margin, minsize, onet, pnet, rnet, threshold, options, work_dir)
File "core/facerec.py", line 434, in run_detection
ret = model.detector.detect_face(img, det_type=0)
File "/home/trz/progetti/zzzz/git_repo/ai-face-matching/arcface/mtcnn_detector.py", line 365, in detect_face
total_boxes.extend(local_boxes)
File "/home/trz/progetti/zzzz/git_repo/ai-face-matching/arcface/helper.py", line 168, in detect_first_stage_warpper
return detect_first_stage(*args)
File "/home/trz/progetti/zzzz/git_repo/ai-face-matching/arcface/helper.py", line 156, in detect_first_stage
output = net.predict(input_buf)
File "/home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/model.py", line 736, in predict
o_list.append(o_nd[0:real_size].asnumpy())
File "/home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2532, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/base.py", line 255, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:21:31] /home/ubuntu/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory
Stack trace:
[bt] (0) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6a6deb) [0x7fbc5fe28deb]
[bt] (1) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6c91b27) [0x7fbc66413b27]
[bt] (2) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x39edffe) [0x7fbc6316fffe]
[bt] (3) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x37ca914) [0x7fbc62f4c914]
[bt] (4) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x37d8941) [0x7fbc62f5a941]
[bt] (5) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x37dbf60) [0x7fbc62f5df60]
[bt] (6) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x37dc1f6) [0x7fbc62f5e1f6]
[bt] (7) /home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x37d7094) [0x7fbc62f59094]
[bt] (8) /home/trz/miniconda3/envs/facerec/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fbcd2a55678]
The problem here is that the call in mshadow isn't requesting memory from our memory pool. We are in the process of deprecating mshadow.
@lorenzob which date is your nightly build of? Could you build https://github.com/apache/incubator-mxnet/pull/17114 from source and check the issue still persists?
CC @ptrendx seems there may be another fusion issue here
@leezu I'm willing too but I'm stuck with the build on a gtest error and I cannot find any complete build guide covering all the requisites. Can you help me with this?
I have gtest in /usr/src/gtest and /usr/local/lib/gtest/
I tried:
cmake -DBLAS=open -DUSE_CUDA=1 -DUSE_CUDA_PATH=/usr/local/cuda -DUSE_CUDNN=1 -DUSE_MKL_IF_AVAILABLE=ON -DGTEST_ROOT=/usr/local/lib/gtest/ -DCMAKE_BUILD_TYPE=Release -GNinja ..
but I have this error:
[CMakeError.log](https://github.com/apache/incubator-mxnet/files/3987836/CMakeError.log)
[CMakeOutput.log](https://github.com/apache/incubator-mxnet/files/3987837/CMakeOutput.log)
[CMakeError.log](https://github.com/apache/incubator-mxnet/files/3987838/CMakeError.log)
[CMakeOutput.log](https://github.com/apache/incubator-mxnet/files/3987839/CMakeOutput.log)
-- CMAKE_CROSSCOMPILING FALSE
-- CMAKE_HOST_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_NAME Linux
-- CMake version '3.10.2' using generator 'Ninja'
-- Determining F16C support
-- F16C enabled
CMake Error at CMakeLists.txt:281 (add_subdirectory):
The source directory
/home/trz/github/incubator-mxnet/3rdparty/mkldnn
does not contain a CMakeLists.txt file.
-- Found NVTX (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libnvToolsExt.so)
-- Found MKL (include: /opt/intel/mkl/include, lib: /opt/intel/mkl/lib/intel64/libmkl_rt.so
-- CUDA detected: 10.0
-- Found cuDNN (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)
-- Running GPU architecture autodetection
-- Found CUDA arch 6.1
-- Added CUDA NVCC flags for: sm_61
-- Using JEMalloc malloc
-- OpenCV 3.2.0 found (/usr/share/OpenCV)
-- OpenCV_LIBS=opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs
USE_LAPACK is ON
CMake Error at CMakeLists.txt:516 (add_subdirectory):
add_subdirectory given source
"/home/trz/github/incubator-mxnet/3rdparty/googletest/googletest" which is
not an existing directory.
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_cublas_device_LIBRARY (ADVANCED)
linked by target "im2rec" in directory /home/trz/github/incubator-mxnet
linked by target "im2rec" in directory /home/trz/github/incubator-mxnet
linked by target "mxnet" in directory /home/trz/github/incubator-mxnet
linked by target "mxnet" in directory /home/trz/github/incubator-mxnet
linked by target "mxnet_unit_tests" in directory /home/trz/github/incubator-mxnet/tests
linked by target "mxnet_unit_tests" in directory /home/trz/github/incubator-mxnet/tests
linked by target "image-classification-predict" in directory /home/trz/github/incubator-mxnet/example/image-classification/predict-cpp
linked by target "image-classification-predict" in directory /home/trz/github/incubator-mxnet/example/image-classification/predict-cpp
-- Configuring incomplete, errors occurred!
See also "/home/trz/github/incubator-mxnet/build/CMakeFiles/CMakeOutput.log".
See also "/home/trz/github/incubator-mxnet/build/CMakeFiles/CMakeError.log".
@lorenzob thank you. Actually there is the 3rdparty/googletest
submodule included in MXNet. You need to run git submodule update --init --recursive
or clone with git clone --recursive https://github.com/apache/incubator-mxnet/
to make sure all submodules are initialized correctly.
A better error message should be provided indeed..
@lorenzob Can you provide some repro instructions for the error where you needed to disable fusion?
@leezu Thanks, I just did a straight checkout from github to get started and missed the recursive part from the docs.
I still need a little more help with the install:
I upgraded cmake to 3.14 (3.10.2 with cuda 10 does not work, see: https://root-forum.cern.ch/t/intallation-cuda-cublas-device-library-advanced-set-to-notfound/33206/9 )
I had to manually link liblapack.so too:
sudo ln -s /usr/lib/x86_64-linux-gnu/liblapack.so.3 /usr/lib/liblapack.so
to fix /usr/bin/ld: cannot find -llapack from ninja (also had to install ninja).
I'm following this doc:
https://mxnet.apache.org/get_started/build_from_source
I've seen the ubuntu_core.sh and ubuntu_python.sh but I prefer to do it step by step and I'm also using mint 19.2.
Now to add the module to my conda env I did:
python python/setup.py install
This added an mxnet to the pip list but not an mxnet-cu100.
With only the mxnet module it does not work, if I ask for mxnet.version I get:
AttributeError: module 'mxnet' has no attribute 'version'
I manually added
mxnet-cu100 1.6.0b20191102
and it works (still with the broken OOM) but I think it is not the right thing to do.
What are final steps? Is there a doc I missed?
@ptrendx Clone this project:
https://github.com/deepinsight/insightface
Download and extract from here:
https://www.dropbox.com/s/tj96fsm6t6rq8ye/model-r100-arcface-ms1m-refine-v2.zip?dl=0
model-r100-ii into models keeping the subfolder.
Copy the attached test into deploy and run it from that folder.
I've noticed that I get the error only if the batch size is greater than one.
@lorenzob you need to uninstall all mxnet packages first (ie uninstall mxnet-cu100), and then install the source compiled version. It is expected that only mxnet
package (and no mxnet-cuX) package is installed afterwards.
The attribute error you experienced is due to having two versions installed (in my experience).
Sorry to hear you experienced issues with the cmake & cuda setup. The requirements on a recent CMake version will be properly declared once https://github.com/apache/incubator-mxnet/pull/17031 is reviewed and merged.
@leezu I added the mxnet-cuX after I saw this error:
AttributeError: module 'mxnet' has no attribute 'cpu'
Right after the build and install this is the situation:
Python 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 19:16:44)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> dir(mx)
['__doc__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
>>> mx.__path__
_NamespacePath(['/home/trz/miniconda3/envs/facerec/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet'])
The module is found inside the conda env but is completely empty.
The last line from the ninja build is this:
750/750] : && /usr/bin/c++ -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -std=c++11 -mf16c -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -fopenmp -std=c++0x -O3 -DNDEBUG -rdynamic tests/CMakeFiles/mxnet_unit_tests.dir/cpp/engine/engine_shutdown_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/engine/omp_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/engine/thread_local_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/engine/threaded_engine_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/kvstore/gpu_topology_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/misc/base.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/misc/libinfo_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/activation_perf.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/batchnorm_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/coreop_perf.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/dropout_perf.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/fully_conn_perf.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/krprod_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/mkldnn_operator_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/mkldnn_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/runner/core_op_runner_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/slice_channel_perf.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/operator/tune/operator_tune_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/storage/storage_test.cc.o tests/CMakeFiles/mxnet_unit_tests.dir/cpp/test_main.cc.o -o tests/mxnet_unit_tests -L/usr/local/cuda/lib64 -Wl,-rpath,/usr/local/cuda/lib64:/home/trz/github/incubator-mxnet/build/3rdparty/openmp/runtime/src lib/libgtest.a -Wl,--whole-archive libmxnet.a -Wl,--no-whole-archive 3rdparty/dmlc-core/libdmlc.a 3rdparty/mkldnn/src/libdnnl.a /usr/local/cuda/lib64/libnvToolsExt.so -lmkl_rt /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libcurand.so /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libcurand.so /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcudnn.so -lrt -ljemalloc /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.3.2.0 3rdparty/openmp/runtime/src/libomp.so -lpthread -llapack -ljemalloc /usr/local/cuda/lib64/libcudnn.so -lcufft -lcusolver -lnvrtc -lcuda -lrt -lrt -lpthread -llapack -lcufft -lcusolver -lnvrtc -lcuda /usr/lib/x86_64-linux-gnu/libopencv_imgcodecs.so.3.2.0 /usr/lib/x86_64-linux-gnu/libopencv_imgproc.so.3.2.0 /usr/lib/x86_64-linux-gnu/libopencv_core.so.3.2.0 -ldl -lpthread && :
Did the build complete correctly?
@lorenzob I think the problem is that you didn't uninstall all mxnet versions obtained via pip, before installing the self-compiled version.
Ie. first pip uninstall mxnet
, pip uninstall mxnet-cuX
, etc.
Then cd ~/path/to/mxnet-incubator/python; pip install --user -e .
to install the self-compiled version.
@leezu I did remove the all the mxnet packages and also created a new conda env to be sure.
But I used the "setup.py install" script to install the module rather than pip directly. Now it works, thanks, but the error is still there:
[...]
File "/home/trz/progetti/zzzz/git_repo/ai-face-matching/arcface/helper.py", line 156, in detect_first_stage
output = net.predict(input_buf)
File "/home/trz/github/incubator-mxnet/python/mxnet/model.py", line 750, in predict
o_list.append(o_nd[0:real_size].asnumpy())
File "/home/trz/github/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2552, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/trz/github/incubator-mxnet/python/mxnet/base.py", line 278, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:48:23] /home/trz/github/incubator-mxnet/include/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory
Thanks for confirming @lorenzob. So you confirmed the issue disappears when setting MXNET_USE_FUSION=0
with the code from #17114, correct?
How close to out of memory are you with MXNET_USE_FUSION=0
? (ie. what's the memory usage)
CC @ptrendx
@leezu @ptrendx I used commit d000c3 that I think includes #17114
I no longer get the "Operator is non-differentiable" error if I do not set the MXNET_USE_FUSION. Setting MXNET_USE_FUSION=0 never solved the OOM error.
I still get the OOM if I use more than 10 112x112 images in one batch (for inference). When I get the first OOM I'm actually filling all the available free memory (about 3GB) and memory usage remains at near 100% after the first OOM exception (I added a time.sleep and checked this with nvidia-smi).
The first OOM and the second one are different:
First:
File "core/facerec.py", line 184, in compare
embedding = model.model.get_outputs()[0].asnumpy()
File "/home/trz/github/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2552, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/trz/github/incubator-mxnet/python/mxnet/base.py", line 278, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [12:42:39] ../src/storage/./pooled_storage_manager.h:161: cudaMalloc retry failed: out of memory
Second:
File "/home/trz/progetti/zzzz/git_repo/ai-face-matching/arcface/helper.py", line 156, in detect_first_stage
output = net.predict(input_buf)
File "/home/trz/github/incubator-mxnet/python/mxnet/model.py", line 750, in predict
o_list.append(o_nd[0:real_size].asnumpy())
File "/home/trz/github/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2552, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/trz/github/incubator-mxnet/python/mxnet/base.py", line 278, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [12:39:32] /home/trz/github/incubator-mxnet/include/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory
I seem to be having the same problem.
File ..., line 225, in detect
scores = net_out[idx].asnumpy()
File "/opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
ctypes.c_size_t(data.size)))
File "/opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [05:22:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory
Stack trace:
[bt] (0) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f8f0701e4cb]
[bt] (1) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f8f09ac7431]
[bt] (2) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f8f09d241ee]
[bt] (3) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f8f09d27a16]
[bt] (4) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f8f091497a9]
[bt] (5) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f8f0914fa1a]
[bt] (6) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f8f0912fcd1]
[bt] (7) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f8f091331e0]
[bt] (8) /opt/conda/envs/mlaas/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f8f09133476]
@lorenzob were you able to solve the out of memory errors? @szha , if the problem is in the call to mshadow, is there a way of working around mshadow to avoid it? Thanks.
@ballcue No, but I did not do further tests with the latest versions since the last post. As an obvious workaround I reduced the batch size and, as my process is running under gunicorn, I kill the process in case of OOM so that a new one is respawned.
@lorenzob Got it, thank you.
Description
After a cudaMalloc failed: out of memory error is raised everything becomes unusable with more out of memory errors (even if a smaller batch is provided).
If I load two models both become unusable after the OOM.
Error Message
To Reproduce
Invoke the model with a large enough batch to get an OOM. Now call it again with a small batch that would not throw as OOM if called on its own. This small batch throws an OOM too.
It looks like the memory from the previous call is not completely released.
What have you tried to solve it?
mxnet.context.current_context().empty_cache() mxnet.gpu(0).empty_cache()
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: