Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Slyforce@ has reported a slow performance of argmax compared to max. I've tried it on EC2 machine and confirm the finding - on high dimensions difference between max and argmax looks suspiciously high. Haibin suspects the code is not parallelized well.
Environment info (Required)
----------Python Info----------
Version : 3.6.4
Compiler : GCC 7.2.0
Build : ('default', 'Jan 16 2018 18:10:19')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 10.0.1
Directory : /home/ubuntu/.virtualenvs/so_question2/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.2.0
Directory : /home/ubuntu/.virtualenvs/so_question2/lib/python3.6/site-packages/mxnet
Commit Hash : 297c64fd2ee404612aa3ecc880b940fb2538039c
----------System Info----------
Platform : Linux-4.4.0-1054-aws-x86_64-with-debian-stretch-sid
system : Linux
node : ip-172-31-84-4
release : 4.4.0-1054-aws
version : #63-Ubuntu SMP Wed Mar 28 19:42:42 UTC 2018
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2699.984
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.16
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single retpoline kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0017 sec, LOAD: 0.4570 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0665 sec, LOAD: 0.0495 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 1.3137 sec, LOAD: 0.3615 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0214 sec, LOAD: 0.1381 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0029 sec, LOAD: 0.1154 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0025 sec, LOAD: 0.0361 sec.
Package used (Python/R/Scala/Julia):
Python 3
Minimum reproducible example
import time
import mxnet as mx
def max(x, ctx):
return mx.nd.max(x, axis=1)
def argmax(x, ctx):
return mx.nd.argmax(x, axis=1)
def measure_time(func, iters, inputs, ctx):
begin = time.time()
for i in range(iters):
result = func(inputs[i,:,:], ctx=ctx)
result.wait_to_read()
return time.time() - begin
ctx = mx.gpu()
batch_size = 32
iterations = 500
for reduction_dimension in [25, 50, 100, 1000, 10000, 100000]:
print('reduction dimension: {}'.format(reduction_dimension))
inputs = mx.nd.random_uniform(0, 100,
shape=(iterations, batch_size, reduction_dimension),
ctx=ctx)
t = measure_time(argmax, iterations, inputs, ctx)
print("argmax took {} seconds".format(t))
t = measure_time(max, iterations, inputs, ctx)
print("max took {} seconds".format(t))
print('')
If I run it I get:
reduction dimension: 25
argmax took 0.15082168579101562 seconds
max took 0.13338756561279297 seconds
reduction dimension: 50
argmax took 0.17458558082580566 seconds
max took 0.15340065956115723 seconds
reduction dimension: 100
argmax took 0.26195740699768066 seconds
max took 0.19835686683654785 seconds
reduction dimension: 1000
argmax took 1.2869455814361572 seconds
max took 0.7969081401824951 seconds
reduction dimension: 10000
argmax took 11.152163982391357 seconds
max took 7.157193422317505 seconds
reduction dimension: 100000
argmax took 114.18031907081604 seconds
max took 70.90202450752258 seconds
Description
Slyforce@ has reported a slow performance of argmax compared to max. I've tried it on EC2 machine and confirm the finding - on high dimensions difference between max and argmax looks suspiciously high. Haibin suspects the code is not parallelized well.
Environment info (Required)
Package used (Python/R/Scala/Julia): Python 3
Minimum reproducible example
If I run it I get:
Steps to reproduce