apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

Wrong gradients on Windows-GPU #20471

Open matteosal opened 3 years ago

matteosal commented 3 years ago

sym.zip I only see this on Windows. Download the symbol file and run this script:

import mxnet as mx

json_path = 'sym.json'
sym = mx.sym.load(json_path)

def run_example(ctx, reqs):
    ex = sym._bind(
        ctx,
        {
            '.Inputs.Input': mx.ndarray.array([[1, 2, 3]], ctx=ctx),
            '.Inputs.Target': mx.ndarray.array([[4, 5, 6]], ctx=ctx),
            'seq_715248120': mx.ndarray.array([3], ctx=ctx)
        },
        args_grad={
            '.Inputs.Input': mx.ndarray.zeros([1, 3], ctx=ctx),
            '.Inputs.Target': mx.ndarray.zeros([1, 3], ctx=ctx),
            'seq_715248120': mx.ndarray.zeros([1], ctx=ctx)
        },
        grad_req=dict(zip(['.Inputs.Input', '.Inputs.Target', 'seq_715248120'], reqs))
    )

    ex.forward()
    ex.backward(out_grads=[mx.ndarray.array([1], ctx=ctx), mx.ndarray.array([1], ctx=ctx)])

    print(ex.grad_dict)

print('Input + Target gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input + Target gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Target gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Target gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is:

Input + Target gradient, CPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Input + Target gradient, GPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

The Target gradient has the sign flipped in the last example.

matteosal commented 3 years ago

I see the same sign flip with this other symbol (which can be fed to the same above script) sym2.zip

And with this one sym3.zip Which goes with this script:

import numpy as np
import mxnet as mx

json_path = 'sym3.json'
sym = mx.sym.load(json_path)

input_1 = np.random.rand(1, 2, 3, 4).tolist()
input_2 = np.random.rand(1, 2, 4).tolist()
input_3 = np.random.rand(1, 2).tolist()

def run_example(ctx, reqs):
    ex = sym._bind(
        ctx,
        {
            '.Inputs.Input1': mx.ndarray.array(input_1, ctx=ctx),
            '.Inputs.Input2': mx.ndarray.array(input_2, ctx=ctx),
            '.Inputs.Input3': mx.ndarray.array(input_3, ctx=ctx)
        },
        args_grad={
            '.Inputs.Input1': mx.ndarray.zeros([1, 2, 3, 4], ctx=ctx),
            '.Inputs.Input2': mx.ndarray.zeros([1, 2, 4], ctx=ctx),
            '.Inputs.Input3': mx.ndarray.zeros([1, 2], ctx=ctx)
        },
        grad_req=dict(zip(['.Inputs.Input1', '.Inputs.Input2', '.Inputs.Input3'], reqs))
    )

    ex.forward()
    ex.backward(out_grads=[mx.ndarray.ones([1, 2, 3, 4], ctx=ctx)])

    print(ex.grad_dict['.Inputs.Input2'])

print('Input1 + Input2 gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input1 + Input2 gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Input2 gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Input2 gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is

Input1 + Input2 gradient, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input1 + Input2 gradient, GPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @gpu(0)>

Input2 gradient only, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input2 gradient only, GPU (WRONG):

[[[3. 2. 3. 2.]
  [0. 2. 2. 3.]]]
<NDArray 1x2x4 @gpu(0)>
TristonC commented 3 years ago

Which version of MXNet did you @matteosal use?

TristonC commented 3 years ago

With your sym3 example, here is what I got with MXNet 1.9 on Linux. Not sure if this issue only occurs on Windows. Did you @matteosal try it on Linux?


Input1 + Input2 gradient, CPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:37:54] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input1 + Input2 gradient, GPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:38:01] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>

Input2 gradient only, CPU (OK):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input2 gradient only, GPU (WRONG):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>
matteosal commented 3 years ago

I am using version 2.0, built from source at commit fabcd145cd496628791f9f2ea813048360ac33ca I have tried the same example on Linux (building from the same commit) and the results are good there. This issue only affects Windows.

TristonC commented 3 years ago

@matteosal Thanks for the update. @leezu Do you have windows platform to help triage the the problem?

leezu commented 3 years ago

I'm not a Windows user, so it's very hard for me to get MXNet running on Windows. @yajiedesign is Windows expert, maybe he can help

chinakook commented 3 years ago

I've tested with a 2.0 version modified by myself on Windows, and It's OK.

Input + Target gradient, CPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Input + Target gradient, GPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}
TristonC commented 3 years ago

@chinakook What did you modify? Is it related to this gradient issue? Could you share it with @matteosal?

matteosal commented 3 years ago

A ping on this @chinakook what modification are you talking about? Can you reproduce the problem on a plain v2.0 build?

matteosal commented 2 years ago

A ping on this. Can anyone please investigate?

matteosal commented 2 years ago

@szha @leezu another ping on this :)

barry-jin commented 2 years ago

@matteosal What build settings should we use to reproduce this issue?

matteosal commented 2 years ago

@barry-jin here they are:

cmake -G"Visual Studio 15 2017 Win64" -T host=x64 ^
 %= GENERAL FLAGS =% ^
 -DCMAKE_INSTALL_PREFIX=%output_dir% ^
 -DCMAKE_BUILD_TYPE=Release ^
 -DCMAKE_SKIP_BUILD_RPATH=On ^
 -DUSE_OPENCV=OFF ^
 -DUSE_F16C=Off %= float16 support =%^
 -DUSE_INT64_TENSOR_SIZE=ON ^
 -DCMAKE_C_FLAGS="-D_WIN32" ^
 -DCMAKE_CXX_FLAGS="-D_WIN32" ^
 -DCMAKE_C_FLAGS_RELEASE="/MT -DNDEBUG" ^
 -DCMAKE_CXX_FLAGS_RELEASE="/MT -DNDEBUG" ^
 -DMXNET_FORCE_SHARED_CRT=OFF %= link statically to C runtime =%^
 -DCMAKE_SHARED_LINKER_FLAGS="/DELAYLOAD:nvcuda.dll delayimp.lib" ^
 -DUSE_MXNET_LIB_NAMING=OFF ^
 %= MATH BACKENDS =% ^
 -DBLAS=MKL ^
 -DUSE_LAPACK=OFF ^
 -DUSE_ONEDNN=OFF ^
 -DBLA_VENDOR="Intel10_64ilp" ^
 -DBLA_STATIC=OFF ^
 -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=OFF ^
 -DMKL_INCLUDE_DIR=%mkl_dir% ^
 -DBLAS_LIBRARIES="%mkl_dir%/libiomp5md.lib;%mkl_dir%/mkl_core_dll.lib;%mkl_dir%/mkl_intel_ilp64_dll.lib;%mkl_dir%/mkl_intel_thread_dll.lib" ^
 %= OPENMP =% ^
 -DUSE_OPENMP=ON ^
 -DOpenMP_C_FLAGS="-I%mkl_dir%" ^
 -DOpenMP_C_LIB_NAMES="libiomp5" ^
 -DOpenMP_CXX_FLAGS="-I%mkl_dir%" ^
 -DOpenMP_CXX_LIB_NAMES="libiomp5" ^
 -DOpenMP_libiomp5_LIBRARY="%mkl_dir%/libiomp5md.lib" ^
 %= CUDA =% ^
 -DUSE_CUDA=ON ^
 -DUSE_CUDNN=ON ^
 -DCUDNN_LIBRARY=%home_dir:\=/%cuDNN/lib/cudnn64_8.lib ^
 -DCUDNN_INCLUDE=%home_dir:\=/%cuDNN/include ^
 -DUSE_NCCL=OFF ^
 -DUSE_NVML=OFF ^
 -DCUDNN_ROOT=%home_dir:\=/%cuDNN ^
 -DMXNET_CUDA_ARCH="3.7"\;"5.0"\;"6.0"\;"7.0"\;"8.0+PTX" %= see Readme =%^
 -DCUDAToolkit_ROOT=%cuda_dir% ^
 -DCMAKE_CUDA_COMPILER="%cuda_dir%/bin/nvcc.exe" -I"%cuda_dir%/include" -L"%cuda_dir%/lib/x64"  ^
 -DUSE_SPLIT_ARCH_DLL=OFF ^
 %mxnet_dir%

MKL version is 2019.4 and CUDA version is 11.4.0

matteosal commented 2 years ago

@barry-jin any news on this? I have rebuilt with VC2019 in order to fix this issue but I still see this problem here

barry-jin commented 2 years ago

Sorry, I'm still triaging this issue. I built with settings in build_window.py and can also reproduce this issue.

barry-jin commented 2 years ago

@matteosal Current workaround is to replace 'elemwise_sub' with '_npi_subtract'. There are probably some issues in legacy subtract operator.

matteosal commented 2 years ago

@barry-jin thank you, I have verified that swapping the operator fixes the problem