apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

gluon.utils.split_and_load(even_split=True) is much slower than even_split=False #8645

Open eric-haibin-lin opened 6 years ago

eric-haibin-lin commented 6 years ago

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

import mxnet as mx
from mxnet import gluon
import time
ctx_list = [mx.cpu(0), mx.cpu(1)]
label = mx.nd.ones((128,240000))
mx.nd.waitall()
start = time.time()
for i in range(10):
    labels = gluon.utils.split_and_load(label, ctx_list, batch_axis=1, even_split=True)
mx.nd.waitall()
end = time.time()
print(end - start)

It takes 2.236 seconds. If even_split is changed to False, it only takes 0.68 seconds.

Environment info (Required)

Deep Learning AMI CUDA 9 Ubuntu, p2.8xlarge.

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia): (I'm using ...)

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source):

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash: 399ac038da885ff3dce8e43bcbdf76bb62522e73 (Paste the output of git rev-parse HEAD here.)

Build config: (Paste the content of config.mk, or the build command.)

USE_BLAS=openblas
ADD_CFLAGS += -I/usr/include/openblas
USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda
USE_CUDNN=1

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. 2.

What have you tried to solve it?

1. 2.

anirudhacharya commented 5 years ago

The difference in performance is only when batch_axis=1 when batch_axis=0 there is negligible difference in performance.

utils.split_and_load internally calls utils.split_data. The real performance difference is in split_data when batch_axis=1

anirudhacharya commented 5 years ago

the difference in performance arises here - https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/utils.py#L83-L89

for now, I think the only way to fix it would be implement an operator like numpy.split_array

anirudhacharya commented 5 years ago

@mxnet-label-bot add [Operator, Feature Request]