Closed jperez999 closed 1 year ago
Getting an error about two subgraphs with the same name in the tests now that postorder_iter_nodes
returns nodes within subgraphs.
One option seems to be to remove this line in _find_subgraphs
which is no longer required now that the postorder_iter_nodes
returns all the nodes
https://github.com/NVIDIA-Merlin/core/blob/c5facda7d330aa090bbb9e8ae3daf60db358a3a4/merlin/dag/graph.py#L237
One other thing I noticed while looking at this is related to the way iter_nodes
is currently working. It behaves slightly differently to the post and pre order versions. Using the test test_subgraph_with_summed_subgraphs
as an example.
iter_nodes
doesn't return the Subgraph nodes (while the post and pre order versions do)
set(postorder_iter_nodes([graph.output_node])).difference(set(iter_nodes([graph.output_node])))
# => {<Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph output>}
iter_nodes
returns duplicates nodes (while the post and pre order versions do not)
len(list(iter_nodes([graph.output_node])))
# => 101
len(list(postorder_iter_nodes([graph.output_node])))
# => 17
len(list(preorder_iter_nodes([graph.output_node])))
# => 17
Is that a bug in iter_nodes
, or is there a reason it needs to return a different set and length of nodes compared to the others?
Ok, so thanks to this failing test, I went down quite a deep rabbit hole to find that some assumptions we were making were not correct. The new commits to this PR work to remedy those issues that were found:
flatten_subgraphs
to the iteration function (iter_nodes, postorder_iter_nodes, preorder_iter_nodes) that allows you to designate a flattening when you need it. The kwarg is set to False by default to allow for backwards compatibility. Currently, this should only be necessary in merlin-systems, when dealing with node specific actions like loading, exporting and saving artifacts. It is used in the testing suite of nvtabular for a function that looks to retrieve categorify nodes to check the data within them. These changes will be reflected in merlin-systems and nvtabular, where we activate flatten_subgraphs
. input_schemas
, to allow the user to specify if they want to retrieve an input schema or an output schema. 2. when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).
@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).
Left some style suggestions, but approving as is. You can take or leave the suggestions and merge when you're ready.
I will make the updates in a subsequent PR.
- when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).
@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).
The Subgraph is a special type of operator, it is considered a graph. This means it should allow for all the capabilities/responsibilities of a graph. We are working toward a position where you could have certain subgraphs in different states, where subgraphs that have been fit already might get connected to other subgraphs that have not been fit. In this case, if you do not allow for subgraphs to fit then each node would have to carry that information. By allowing the subgraph to fit and transform instead of just being a placeholder/divider of nodes, it provides a way to keep that information about fitting and possibly other metadata in the future, at the graph (subgraph) level. We have not yet completely built that feature but we are taking incremental steps in this direction. If we just flattened everything out, we would lose the information about the subgraph as a whole.
This PR adds logic to post and pre order traversal methods to handle subgraphs. It flattens the subgraph while still including the actual subgraph operator in the nodes. This is important because it ensures that when we are constructing the schemas for all the operator the subgraph operator does not get skipped. Otherwise it will cause issues with downstream operators that will essentially get schemas from operators previous to the subgraph or if there are none it will get schema from root (i.e. dataset).