Slow perfromance of argmax compared to max on GPU

Description

Slyforce@ has reported a slow performance of argmax compared to max. I've tried it on EC2 machine and confirm the finding - on high dimensions difference between max and argmax looks suspiciously high. Haibin suspects the code is not parallelized well.

Environment info (Required)

----------Python Info----------
Version      : 3.6.4
Compiler     : GCC 7.2.0
Build        : ('default', 'Jan 16 2018 18:10:19')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ubuntu/.virtualenvs/so_question2/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.2.0
Directory    : /home/ubuntu/.virtualenvs/so_question2/lib/python3.6/site-packages/mxnet
Commit Hash   : 297c64fd2ee404612aa3ecc880b940fb2538039c
----------System Info----------
Platform     : Linux-4.4.0-1054-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-84-4
release      : 4.4.0-1054-aws
version      : #63-Ubuntu SMP Wed Mar 28 19:42:42 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.16
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single retpoline kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0017 sec, LOAD: 0.4570 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0665 sec, LOAD: 0.0495 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 1.3137 sec, LOAD: 0.3615 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0214 sec, LOAD: 0.1381 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0029 sec, LOAD: 0.1154 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0025 sec, LOAD: 0.0361 sec.

Package used (Python/R/Scala/Julia): Python 3

Minimum reproducible example

import time
import mxnet as mx

def max(x, ctx):
    return mx.nd.max(x, axis=1)

def argmax(x, ctx):
    return mx.nd.argmax(x, axis=1)

def measure_time(func, iters, inputs, ctx):

    begin = time.time()
    for i in range(iters):
        result = func(inputs[i,:,:], ctx=ctx)
        result.wait_to_read()

    return time.time() - begin

ctx = mx.gpu()
batch_size = 32
iterations = 500
for reduction_dimension in [25, 50, 100, 1000, 10000, 100000]:
    print('reduction dimension: {}'.format(reduction_dimension))
    inputs = mx.nd.random_uniform(0, 100,
                                  shape=(iterations, batch_size, reduction_dimension),
                                  ctx=ctx)

    t = measure_time(argmax, iterations, inputs, ctx)
    print("argmax took {} seconds".format(t))

    t = measure_time(max, iterations, inputs, ctx)
    print("max took {} seconds".format(t))

    print('')

If I run it I get:

reduction dimension: 25
argmax took 0.15082168579101562 seconds
max took 0.13338756561279297 seconds

reduction dimension: 50
argmax took 0.17458558082580566 seconds
max took 0.15340065956115723 seconds

reduction dimension: 100
argmax took 0.26195740699768066 seconds
max took 0.19835686683654785 seconds

reduction dimension: 1000
argmax took 1.2869455814361572 seconds
max took 0.7969081401824951 seconds

reduction dimension: 10000
argmax took 11.152163982391357 seconds
max took 7.157193422317505 seconds

reduction dimension: 100000
argmax took 114.18031907081604 seconds
max took 70.90202450752258 seconds

Steps to reproduce

Run the script above
See big difference in numbers.

apache / mxnet

Slow perfromance of argmax compared to max on GPU #11337

Description

Environment info (Required)

Minimum reproducible example

Steps to reproduce