Open samskalicky opened 5 years ago
During graph partitioning, some nodes are added to subgraphs, thus potentially changing the order of the DFS traversal.
Previously, I thought graph partitioning from default
backend won't change input order. Can you explain more about the input order change? like, which op in op_names cause this, or which kind of op connection cause this?
CC @reminisce as he may know something behind.
@mxnet-label-bot add [Bug]
@ZhennanQin Im not sure why we think the input order should not change after partitioning. It makes sense to me that since we're changing the graph its possible the order may change. Maybe @reminisce can explain the rationale behind that assumption that it should not change. All ive been able to figure out the input order does change, hence the error message above about the shapes mismatch.
@samskalicky The order should be preserved because that's how I implemented it. If it's changed in your case, there might be a bug. Can you draw a diagram to show which inputs' order is changed?
@reminisce can you explain how you implemented it in such a way that the order is preserved?
There is a failing example with the sources in the description, would really appreciate your help debugging through this with your expertise and familiarity.
Here is an example showing the problem:
Assume the initial traversal order is: A -> J A -> J -> G A -> J -> G -> D A -> J -> G -> D -> E A -> J -> G -> D -> F A -> J -> G -> H A -> J -> K A -> B A -> B -> C
And then assume that B, D, and G are included in the subgraph.
Then the post partitioning traversal would be: A -> J A -> J -> S A -> J -> S -> E A -> J -> S -> F A -> J -> S -> H A -> J -> S -> C A -> J -> K
Thus the order would change from: [E, F, H, K, C] to: [E, F, H, C, K]
@samskalicky Thanks for the diagram. This is indeed the case that breaks the order of inputs and it's not considered in the implementation. We may need the solution in your PR to solve this.
So it's a common case from all backend that the input order may get change after partitioning. So we should consider this in whole partition flow, not for a particular backend. @samskalicky Thanks for finding and fixing this.
I believe this issue is fixed in master, below is my result:
[09:17:25] src/executor/graph_executor.cc:1936: Subgraph backend default is activated.
[09:17:25] src/executor/graph_executor.cc:1735: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[
[[[ 0.]
[ 0.]
[ 0.]
...
[-1.]
[-1.]
[-1.]]]
<NDArray 1x80000x1 @cpu(0)>,
[[[ 0.05506234]
[ 0.05506234]
[ 0.05506234]
...
[-1. ]
[-1. ]
[-1. ]]]
<NDArray 1x80000x1 @cpu(0)>,
[[[ -9.554436 224. 236.60925 224. ]
[ -9.554436 224. 236.60925 224. ]
[ -9.554436 224. 236.60925 224. ]
...
[ -1. -1. -1. -1. ]
[ -1. -1. -1. -1. ]
[ -1. -1. -1. -1. ]]]
<NDArray 1x80000x4 @cpu(0)>]
Result is the same with quoting out os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'
Description
The input names of a symbol are produced by a DFS traversal of the symbol's graph from the outputs back up to the inputs. During graph partitioning, some nodes are added to subgraphs, thus potentially changing the order of the DFS traversal. After graph partitioning, shape propagation occurs, and the inferred shapes for the inputs are returned in the order that they appear in a DFS traversal.
However, when graph partitioning happens and the DFS traversal order changes, the inferred shapes may be returned in a different order than expected. Since the original symbol is not modified, the caller is expecting the shapes in the same order as the original symbol.
Since DFS order is not guaranteed to be identical before and after partitioning, we need to map the names-to-shapes and ensure that the shapes are returned in the original order.
Environment info (Required)
The error occurs on every release, and is reproducible on the master branch. I have built from source using the master branch and reproduced the problem.
Error Message:
Minimum reproducible example
This problem occurs on a few models, the one that I can share is the faster-rcnn model from the GluonCV package. Here is how to get the model:
Once the model is exported, here is the code to reproduce the error using CPU context:
What have you tried to solve it?
Ive tested a fix in a private branch: https://github.com/samskalicky/incubator-mxnet/commit/517d29498059d081873d1bd160d95479a5c8cea9