Wrong gradients on Windows-GPU

matteosal commented 3 years ago

sym.zip I only see this on Windows. Download the symbol file and run this script:

import mxnet as mx

json_path = 'sym.json'
sym = mx.sym.load(json_path)

def run_example(ctx, reqs):
    ex = sym._bind(
        ctx,
        {
            '.Inputs.Input': mx.ndarray.array([[1, 2, 3]], ctx=ctx),
            '.Inputs.Target': mx.ndarray.array([[4, 5, 6]], ctx=ctx),
            'seq_715248120': mx.ndarray.array([3], ctx=ctx)
        },
        args_grad={
            '.Inputs.Input': mx.ndarray.zeros([1, 3], ctx=ctx),
            '.Inputs.Target': mx.ndarray.zeros([1, 3], ctx=ctx),
            'seq_715248120': mx.ndarray.zeros([1], ctx=ctx)
        },
        grad_req=dict(zip(['.Inputs.Input', '.Inputs.Target', 'seq_715248120'], reqs))
    )

    ex.forward()
    ex.backward(out_grads=[mx.ndarray.array([1], ctx=ctx), mx.ndarray.array([1], ctx=ctx)])

    print(ex.grad_dict)

print('Input + Target gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input + Target gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Target gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Target gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is:

Input + Target gradient, CPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Input + Target gradient, GPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

The Target gradient has the sign flipped in the last example.

matteosal commented 3 years ago

I see the same sign flip with this other symbol (which can be fed to the same above script) sym2.zip

And with this one sym3.zip Which goes with this script:

import numpy as np
import mxnet as mx

json_path = 'sym3.json'
sym = mx.sym.load(json_path)

input_1 = np.random.rand(1, 2, 3, 4).tolist()
input_2 = np.random.rand(1, 2, 4).tolist()
input_3 = np.random.rand(1, 2).tolist()

def run_example(ctx, reqs):
    ex = sym._bind(
        ctx,
        {
            '.Inputs.Input1': mx.ndarray.array(input_1, ctx=ctx),
            '.Inputs.Input2': mx.ndarray.array(input_2, ctx=ctx),
            '.Inputs.Input3': mx.ndarray.array(input_3, ctx=ctx)
        },
        args_grad={
            '.Inputs.Input1': mx.ndarray.zeros([1, 2, 3, 4], ctx=ctx),
            '.Inputs.Input2': mx.ndarray.zeros([1, 2, 4], ctx=ctx),
            '.Inputs.Input3': mx.ndarray.zeros([1, 2], ctx=ctx)
        },
        grad_req=dict(zip(['.Inputs.Input1', '.Inputs.Input2', '.Inputs.Input3'], reqs))
    )

    ex.forward()
    ex.backward(out_grads=[mx.ndarray.ones([1, 2, 3, 4], ctx=ctx)])

    print(ex.grad_dict['.Inputs.Input2'])

print('Input1 + Input2 gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input1 + Input2 gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Input2 gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Input2 gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is

Input1 + Input2 gradient, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input1 + Input2 gradient, GPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @gpu(0)>

Input2 gradient only, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input2 gradient only, GPU (WRONG):

[[[3. 2. 3. 2.]
  [0. 2. 2. 3.]]]
<NDArray 1x2x4 @gpu(0)>

TristonC commented 3 years ago

Which version of MXNet did you @matteosal use?

TristonC commented 3 years ago

With your sym3 example, here is what I got with MXNet 1.9 on Linux. Not sure if this issue only occurs on Windows. Did you @matteosal try it on Linux?


Input1 + Input2 gradient, CPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:37:54] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input1 + Input2 gradient, GPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:38:01] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>

Input2 gradient only, CPU (OK):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>

Input2 gradient only, GPU (WRONG):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>

matteosal commented 3 years ago

I am using version 2.0, built from source at commit fabcd145cd496628791f9f2ea813048360ac33ca I have tried the same example on Linux (building from the same commit) and the results are good there. This issue only affects Windows.

TristonC commented 3 years ago

@matteosal Thanks for the update. @leezu Do you have windows platform to help triage the the problem?

leezu commented 3 years ago

I'm not a Windows user, so it's very hard for me to get MXNet running on Windows. @yajiedesign is Windows expert, maybe he can help

chinakook commented 3 years ago

I've tested with a 2.0 version modified by myself on Windows, and It's OK.

Input + Target gradient, CPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Input + Target gradient, GPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}

Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

TristonC commented 3 years ago

@chinakook What did you modify? Is it related to this gradient issue? Could you share it with @matteosal?

matteosal commented 3 years ago

A ping on this @chinakook what modification are you talking about? Can you reproduce the problem on a plain v2.0 build?