apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

Large host memory usage when using GPU with mobilenets #9574

Open larroy opened 6 years ago

larroy commented 6 years ago

Description

Running mobilenet

Environment info (Required)

----------Python Info----------
Version      : 3.6.2
Compiler     : GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)
Build        : ('default', 'Jul 17 2017 16:44:45')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /Users/pllarroy/devel/mxnet/mxnet/mxnet_py3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
objc[26349]: Class CaptureDelegate is implemented in both /usr/local/Cellar/opencv/3.3.0_3/lib/libopencv_videoio.3.3.dylib (0x1128c35d8) and /Users/pllarroy/devel/mxnet/mxnet/mxnet_py3/lib/python3.6/site-packages/cv2/cv2.cpython-36m-darwin.so (0x128f66030). One of the two will be used. Which one is undefined.
Version      : 1.0.1
Directory    : /Users/pllarroy/devel/mxnet/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : 186590d670bd.ant.amazon.com
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Thu Jan 11 22:59:40 PST 2018; root:xnu-3789.73.8~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0427 sec, LOAD: 0.8934 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0383 sec, LOAD: 0.1173 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0396 sec, LOAD: 0.8027 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0500 sec, LOAD: 0.3847 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0010 sec, LOAD: 0.1996 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0408 sec, LOAD: 0.3463 sec.

Build info (Required if built from source)

clang MXNet commit hash:

20253d5ce821ac012e2483b5dfb15bb5b7202f6d (Update test_gluon_model_zoo.py (#9539))

Minimum reproducible example

run_mobilenet.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'''
profile mxnet cpu memory consumption when loading a super-thin network.
requires https://pypi.python.org/pypi/memory_profiler
'''

import mxnet as mx
from collections import namedtuple
import numpy as np
import cv2
Batch = namedtuple('Batch', ['data'])
from scipy.misc import imread

def get_mobilenet():
  '''
  function to get the pre-trained mobilenet1.0 from gluon's model zoo
  '''
  from mxnet.gluon.model_zoo import vision
  mobilenet  = vision.mobilenet1_0(pretrained=True)
  image = mx.nd.zeros((1,3,512,512))
  mobilenet.hybridize()
  mobilenet.forward(image)
  mobilenet.export("mobilenet1.0")

N = 2048
@profile
def my_func():
    sym, arg_params, aux_params = mx.model.load_checkpoint('mobilenet1.0', 0)
    #mod = mx.mod.Module(symbol=sym, context=mx.gpu(), label_names=None)
    mod = mx.mod.Module(symbol=sym, context=mx.cpu(), label_names=None)
    mod.bind(for_training=False, data_shapes=[('data', (1,3,N,N))], label_shapes=mod._label_shapes)
    mod.set_params(arg_params, aux_params, allow_missing=False)
    img = imread('dachshund.jpg')
    img = cv2.resize(img, (512, 512))
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2)
    img = img[np.newaxis, :]
    mod.forward(Batch([mx.nd.array(img)]))
    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]
    print(a[-10::])

#run with: python -m memory_profiler run_mobilenet.py

if __name__ == '__main__':
    get_mobilenet()
    my_func()

# example output:
# ../e/mobilenet_memory_profiling.txt
#[745 752 307  58 968 751  74 127 947 685]
#Filename: run_mobilenet.py

#Line #    Mem usage    Increment   Line Contents
#================================================
#     8  148.652 MiB  148.652 MiB   @profile
#     9                             def my_func():
#    10  165.703 MiB   17.051 MiB       sym, arg_params, aux_params = mx.model.load_checkpoint('mobilenet1.0', 0)
#    11  165.711 MiB    0.008 MiB       mod = mx.mod.Module(symbol=sym, context=mx.gpu(), label_names=None)
#    12 1334.297 MiB 1168.586 MiB       mod.bind(for_training=False, data_shapes=[('data', (1,3,512,512))], label_shapes=mod._label_shapes)
#    13 1338.840 MiB    4.543 MiB       mod.set_params(arg_params, aux_params, allow_missing=False)
#    14 1340.129 MiB    1.289 MiB       img = imread('dachshund.jpg')
#    15 1344.844 MiB    4.715 MiB       img = cv2.resize(img, (512, 512))
#    16 1344.844 MiB    0.000 MiB       img = np.swapaxes(img, 0, 2)
#    17 1344.844 MiB    0.000 MiB       img = np.swapaxes(img, 1, 2)
#    18 1344.844 MiB    0.000 MiB       img = img[np.newaxis, :]
#    19                             #img = np.random.rand(1,3,512,512)
#    20 1346.281 MiB    1.438 MiB       mod.forward(Batch([mx.nd.array(img)]))
#    21 1346.344 MiB    0.062 MiB       prob = mod.get_outputs()[0].asnumpy()
#    22 1346.344 MiB    0.000 MiB       prob = np.squeeze(prob)
#    23 1346.402 MiB    0.059 MiB       a = np.argsort(prob)[::-1]
#    24 1346.418 MiB    0.016 MiB       print(a[-10::])

Steps to reproduce

pip install memory_profiler
pip install opencv-python
pip install scipy
pip install Pillow
wget -O dachshund.jpg https://upload.wikimedia.org/wikipedia/commons/b/b9/Dachshund_brown_puppy.jpg
python -m memory_profiler ./run_mobilenet.py

What have you tried to solve it?

  1. We will try to run Massif to see where the memory is going
larroy commented 6 years ago

Seems the allocation happen inside libcuda image_class_massif image_class_massif_gpu