apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.75k stars 6.8k forks source link

test_subgraph_exe1 fails on windows #19915

Open leezu opened 3 years ago

leezu commented 3 years ago

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/PR-19908/2/pipeline

leezu commented 3 years ago

The first time I see a related error on master branch windows-cpu is https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/master/2455/pipeline of https://github.com/apache/incubator-mxnet/commit/e164ceeb2c4b5fb8cacdac1f0cced683a80b70b0

[2021-02-16T21:06:05.273Z] _______________ test_subgraph_exe4[sym14-op_names14-default_v2] _______________
[2021-02-16T21:06:05.273Z] [gw0] win32 -- Python 3.7.0 C:\Python37\python.exe
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z] sym = <Symbol convolution38>, subgraph_backend = 'default_v2'
[2021-02-16T21:06:05.273Z] op_names = ['sin', 'Convolution']
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z]     @pytest.mark.parametrize('subgraph_backend', ['default', 'default_v2'])
[2021-02-16T21:06:05.273Z]     @pytest.mark.parametrize('sym,op_names', get_graphs())
[2021-02-16T21:06:05.273Z]     def test_subgraph_exe4(sym, subgraph_backend, op_names):
[2021-02-16T21:06:05.273Z]         """Use env var MXNET_SUBGRAPH_BACKEND=default to trigger graph partitioning in bind
[2021-02-16T21:06:05.273Z]         and compare results of the partitioned sym and the original sym."""
[2021-02-16T21:06:05.273Z]         def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None):
[2021-02-16T21:06:05.273Z]             arg_shapes, _, aux_shapes = sym.infer_shape()
[2021-02-16T21:06:05.273Z]             if subgraph_backend is None:
[2021-02-16T21:06:05.273Z]                 arg_array = [mx.nd.random.uniform(shape=shape) for shape in arg_shapes]
[2021-02-16T21:06:05.273Z]                 aux_array = [mx.nd.random.uniform(shape=shape) for shape in aux_shapes]
[2021-02-16T21:06:05.273Z]             else:
[2021-02-16T21:06:05.273Z]                 arg_array = None
[2021-02-16T21:06:05.273Z]                 aux_array = None
[2021-02-16T21:06:05.273Z]             exe = sym._bind(ctx=mx.current_context(),
[2021-02-16T21:06:05.273Z]                            args=arg_array if subgraph_backend is None else original_exec.arg_arrays,
[2021-02-16T21:06:05.273Z]                            aux_states=aux_array if subgraph_backend is None else original_exec.aux_arrays,
[2021-02-16T21:06:05.273Z]                            grad_req='null')
[2021-02-16T21:06:05.273Z]             exe.forward()
[2021-02-16T21:06:05.273Z]             return exe
[2021-02-16T21:06:05.273Z]     
[2021-02-16T21:06:05.273Z]         sym, _, _ = sym
[2021-02-16T21:06:05.273Z] >       original_exec = get_executor(sym)
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z] tests\python\unittest\test_subgraph_op.py:237: 
[2021-02-16T21:06:05.273Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2021-02-16T21:06:05.273Z] tests\python\unittest\test_subgraph_op.py:222: in get_executor
[2021-02-16T21:06:05.273Z]     arg_shapes, _, aux_shapes = sym.infer_shape()
[2021-02-16T21:06:05.273Z] windows_package\python\mxnet\symbol\symbol.py:1132: in infer_shape
[2021-02-16T21:06:05.273Z]     res = self._infer_shape_impl(False, *args, **kwargs)
[2021-02-16T21:06:05.273Z] windows_package\python\mxnet\symbol\symbol.py:1267: in _infer_shape_impl
[2021-02-16T21:06:05.273Z]     ctypes.byref(complete)))
[2021-02-16T21:06:05.273Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z] ret = -1
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z]     def check_call(ret):
[2021-02-16T21:06:05.273Z]         """Check the return value of C API call.
[2021-02-16T21:06:05.273Z]     
[2021-02-16T21:06:05.273Z]         This function will raise an exception when an error occurs.
[2021-02-16T21:06:05.273Z]         Wrap every API call with this function.
[2021-02-16T21:06:05.273Z]     
[2021-02-16T21:06:05.273Z]         Parameters
[2021-02-16T21:06:05.273Z]         ----------
[2021-02-16T21:06:05.273Z]         ret : int
[2021-02-16T21:06:05.273Z]             return value from API calls.
[2021-02-16T21:06:05.273Z]         """
[2021-02-16T21:06:05.273Z]         if ret != 0:
[2021-02-16T21:06:05.273Z] >           raise get_last_ffi_error()
[2021-02-16T21:06:05.273Z] E           mxnet.base.MXNetError: MXNetError: Error in operator convolution38: Shape inconsistent, Provided = [1,0,2,2], inferred shape=(1,3,2,2)
mseth10 commented 3 years ago

The error occurs for the network

    data1 = mx.sym.Variable('data1', shape=(3, 3, 10, 10), dtype=np.float32)
    data2 = mx.sym.Variable('data2', shape=(1, 0, 2, 2))
    data3 = mx.sym.sin(data2)
    conv = mx.sym.Convolution(data=data1, weight=data3, kernel=(2, 2), num_filter=1)
    return (conv, ['data1'], [(3, 3, 10, 10)])

with simple_bind during infer_shape and is flaky.

@samskalicky Do you think we can change the shape of data2 from (1,0,2,2) to (1,3,2,2)? Or is it intended to be inferred during shape inference?

samskalicky commented 3 years ago

no idea, if its flaky then its working (sometimes) and we should figure out why it fails. Just changing the inputs is not a good way to "fix" this, but might be a good place to debug if that makes the problem go away consistently. But that shouldnt be the final resolution, that just hides the problem

leezu commented 3 years ago

This essentially blocks the master CI. I marked more subgraph tests for disabling on windows in https://github.com/apache/incubator-mxnet/pull/19908

samskalicky commented 3 years ago

So these tests pass on linux but are flaky on windows? is that the current state of things?

leezu commented 3 years ago

Yes. Maybe there was a change to the Windows CI infrastructure that triggered this. I'm not sure.

mseth10 commented 3 years ago

Are we still seeing this error? @leezu

leezu commented 3 years ago

The test is currently disabled on Windows:

https://github.com/apache/incubator-mxnet/blob/5722f8b38af58c5a296e46ca695bfaf7cff85040/tests/python/unittest/test_subgraph_op.py#L126-L127

If you think it has been fixed, let's re-enable it :)

DickJC123 commented 3 years ago

I recently set up master with an internal build/CI system, and see the reported failure on linux, but so far only on the CI machines when running the full test suite. The test_subgraph_exe* tests pass when run individually on a non-CI machine. The failure I'm seeing matches the reported one:

Shape inconsistent, Provided = [1,0,2,2], inferred shape=(1,3,2,2)

This error text comes from the macro SHAPE_ASSIGN_CHECK, which calls shape_assign(): https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_common.h#L157-L181

My confusion is in the interpretation of the shape [1,0,2,2]. It seems the test author wanted the C-dimension of this input weight tensor shape to be inferred. However, shape_assign() seems to be applying the 'np_shape' view of the shape, where a 0 represents a known 0-size, generally reserved for a scalar (so incompatible with [1,3,2,2]. I wonder if a 'use_np_shape' mode is being non-deterministically applied somehow to this test. Thoughts anyone?