Open matteosal opened 3 years ago
I see the same sign flip with this other symbol (which can be fed to the same above script) sym2.zip
And with this one sym3.zip Which goes with this script:
import numpy as np
import mxnet as mx
json_path = 'sym3.json'
sym = mx.sym.load(json_path)
input_1 = np.random.rand(1, 2, 3, 4).tolist()
input_2 = np.random.rand(1, 2, 4).tolist()
input_3 = np.random.rand(1, 2).tolist()
def run_example(ctx, reqs):
ex = sym._bind(
ctx,
{
'.Inputs.Input1': mx.ndarray.array(input_1, ctx=ctx),
'.Inputs.Input2': mx.ndarray.array(input_2, ctx=ctx),
'.Inputs.Input3': mx.ndarray.array(input_3, ctx=ctx)
},
args_grad={
'.Inputs.Input1': mx.ndarray.zeros([1, 2, 3, 4], ctx=ctx),
'.Inputs.Input2': mx.ndarray.zeros([1, 2, 4], ctx=ctx),
'.Inputs.Input3': mx.ndarray.zeros([1, 2], ctx=ctx)
},
grad_req=dict(zip(['.Inputs.Input1', '.Inputs.Input2', '.Inputs.Input3'], reqs))
)
ex.forward()
ex.backward(out_grads=[mx.ndarray.ones([1, 2, 3, 4], ctx=ctx)])
print(ex.grad_dict['.Inputs.Input2'])
print('Input1 + Input2 gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input1 + Input2 gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Input2 gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Input2 gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])
Output is
Input1 + Input2 gradient, CPU (OK):
[[[-3. -2. -3. -2.]
[ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>
Input1 + Input2 gradient, GPU (OK):
[[[-3. -2. -3. -2.]
[ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @gpu(0)>
Input2 gradient only, CPU (OK):
[[[-3. -2. -3. -2.]
[ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>
Input2 gradient only, GPU (WRONG):
[[[3. 2. 3. 2.]
[0. 2. 2. 3.]]]
<NDArray 1x2x4 @gpu(0)>
Which version of MXNet did you @matteosal use?
With your sym3 example, here is what I got with MXNet 1.9 on Linux. Not sure if this issue only occurs on Windows. Did you @matteosal try it on Linux?
Input1 + Input2 gradient, CPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:37:54] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
[[[-3. -2. -3. -3.]
[ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>
Input1 + Input2 gradient, GPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:38:01] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[[[-3. -2. -3. -3.]
[ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>
Input2 gradient only, CPU (OK):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[[[-3. -2. -3. -3.]
[ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>
Input2 gradient only, GPU (WRONG):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[[[-3. -2. -3. -3.]
[ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>
I am using version 2.0, built from source at commit fabcd145cd496628791f9f2ea813048360ac33ca I have tried the same example on Linux (building from the same commit) and the results are good there. This issue only affects Windows.
@matteosal Thanks for the update. @leezu Do you have windows platform to help triage the the problem?
I'm not a Windows user, so it's very hard for me to get MXNet running on Windows. @yajiedesign is Windows expert, maybe he can help
I've tested with a 2.0 version modified by myself on Windows, and It's OK.
Input + Target gradient, CPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}
Input + Target gradient, GPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}
Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}
Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}
@chinakook What did you modify? Is it related to this gradient issue? Could you share it with @matteosal?
A ping on this @chinakook what modification are you talking about? Can you reproduce the problem on a plain v2.0 build?
A ping on this. Can anyone please investigate?
@szha @leezu another ping on this :)
@matteosal What build settings should we use to reproduce this issue?
@barry-jin here they are:
cmake -G"Visual Studio 15 2017 Win64" -T host=x64 ^
%= GENERAL FLAGS =% ^
-DCMAKE_INSTALL_PREFIX=%output_dir% ^
-DCMAKE_BUILD_TYPE=Release ^
-DCMAKE_SKIP_BUILD_RPATH=On ^
-DUSE_OPENCV=OFF ^
-DUSE_F16C=Off %= float16 support =%^
-DUSE_INT64_TENSOR_SIZE=ON ^
-DCMAKE_C_FLAGS="-D_WIN32" ^
-DCMAKE_CXX_FLAGS="-D_WIN32" ^
-DCMAKE_C_FLAGS_RELEASE="/MT -DNDEBUG" ^
-DCMAKE_CXX_FLAGS_RELEASE="/MT -DNDEBUG" ^
-DMXNET_FORCE_SHARED_CRT=OFF %= link statically to C runtime =%^
-DCMAKE_SHARED_LINKER_FLAGS="/DELAYLOAD:nvcuda.dll delayimp.lib" ^
-DUSE_MXNET_LIB_NAMING=OFF ^
%= MATH BACKENDS =% ^
-DBLAS=MKL ^
-DUSE_LAPACK=OFF ^
-DUSE_ONEDNN=OFF ^
-DBLA_VENDOR="Intel10_64ilp" ^
-DBLA_STATIC=OFF ^
-DMKL_USE_SINGLE_DYNAMIC_LIBRARY=OFF ^
-DMKL_INCLUDE_DIR=%mkl_dir% ^
-DBLAS_LIBRARIES="%mkl_dir%/libiomp5md.lib;%mkl_dir%/mkl_core_dll.lib;%mkl_dir%/mkl_intel_ilp64_dll.lib;%mkl_dir%/mkl_intel_thread_dll.lib" ^
%= OPENMP =% ^
-DUSE_OPENMP=ON ^
-DOpenMP_C_FLAGS="-I%mkl_dir%" ^
-DOpenMP_C_LIB_NAMES="libiomp5" ^
-DOpenMP_CXX_FLAGS="-I%mkl_dir%" ^
-DOpenMP_CXX_LIB_NAMES="libiomp5" ^
-DOpenMP_libiomp5_LIBRARY="%mkl_dir%/libiomp5md.lib" ^
%= CUDA =% ^
-DUSE_CUDA=ON ^
-DUSE_CUDNN=ON ^
-DCUDNN_LIBRARY=%home_dir:\=/%cuDNN/lib/cudnn64_8.lib ^
-DCUDNN_INCLUDE=%home_dir:\=/%cuDNN/include ^
-DUSE_NCCL=OFF ^
-DUSE_NVML=OFF ^
-DCUDNN_ROOT=%home_dir:\=/%cuDNN ^
-DMXNET_CUDA_ARCH="3.7"\;"5.0"\;"6.0"\;"7.0"\;"8.0+PTX" %= see Readme =%^
-DCUDAToolkit_ROOT=%cuda_dir% ^
-DCMAKE_CUDA_COMPILER="%cuda_dir%/bin/nvcc.exe" -I"%cuda_dir%/include" -L"%cuda_dir%/lib/x64" ^
-DUSE_SPLIT_ARCH_DLL=OFF ^
%mxnet_dir%
MKL version is 2019.4 and CUDA version is 11.4.0
@barry-jin any news on this? I have rebuilt with VC2019 in order to fix this issue but I still see this problem here
Sorry, I'm still triaging this issue. I built with settings in build_window.py and can also reproduce this issue.
@matteosal Current workaround is to replace 'elemwise_sub' with '_npi_subtract'. There are probably some issues in legacy subtract operator.
@barry-jin thank you, I have verified that swapping the operator fixes the problem
sym.zip I only see this on Windows. Download the symbol file and run this script:
Output is:
The
Target
gradient has the sign flipped in the last example.