NVIDIA-Merlin / core

Core Utilities for NVIDIA Merlin
Apache License 2.0
19 stars 14 forks source link

add subgraph logic to post and pre order traversal #345

Closed jperez999 closed 1 year ago

jperez999 commented 1 year ago

This PR adds logic to post and pre order traversal methods to handle subgraphs. It flattens the subgraph while still including the actual subgraph operator in the nodes. This is important because it ensures that when we are constructing the schemas for all the operator the subgraph operator does not get skipped. Otherwise it will cause issues with downstream operators that will essentially get schemas from operators previous to the subgraph or if there are none it will get schema from root (i.e. dataset).

github-actions[bot] commented 1 year ago

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-345

oliverholworthy commented 1 year ago

Getting an error about two subgraphs with the same name in the tests now that postorder_iter_nodes returns nodes within subgraphs.

One option seems to be to remove this line in _find_subgraphs which is no longer required now that the postorder_iter_nodes returns all the nodes https://github.com/NVIDIA-Merlin/core/blob/c5facda7d330aa090bbb9e8ae3daf60db358a3a4/merlin/dag/graph.py#L237

oliverholworthy commented 1 year ago

One other thing I noticed while looking at this is related to the way iter_nodes is currently working. It behaves slightly differently to the post and pre order versions. Using the test test_subgraph_with_summed_subgraphs as an example.

len(list(iter_nodes([graph.output_node])))  
# => 101

len(list(postorder_iter_nodes([graph.output_node])))
# => 17

len(list(preorder_iter_nodes([graph.output_node])))
# => 17

Is that a bug in iter_nodes, or is there a reason it needs to return a different set and length of nodes compared to the others?

jperez999 commented 1 year ago

Ok, so thanks to this failing test, I went down quite a deep rabbit hole to find that some assumptions we were making were not correct. The new commits to this PR work to remedy those issues that were found:

  1. iter_nodes ended with a list that had WAY TOO MANY nodes in it. The reason for this was because we were doing LIFO instead of FIFO. In LIFO the first thing to get popped off the queue was the first thing in the graph. This led to a scenario where when you popped the next Node in the graph it would add its parent and dependencies (the first thing in the graph) to the queue again.
  2. Flattening out the nodes in a subgraph is not ALWAYS the correct behavior. For instance when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph). To handle this we added an optional keyword argument, flatten_subgraphs to the iteration function (iter_nodes, postorder_iter_nodes, preorder_iter_nodes) that allows you to designate a flattening when you need it. The kwarg is set to False by default to allow for backwards compatibility. Currently, this should only be necessary in merlin-systems, when dealing with node specific actions like loading, exporting and saving artifacts. It is used in the testing suite of nvtabular for a function that looks to retrieve categorify nodes to check the data within them. These changes will be reflected in merlin-systems and nvtabular, where we activate flatten_subgraphs.
  3. _combine_schemas in node was designed with the assumption that all leaf nodes in a graph would be selection operators. Selection operators have the unique characteristic of having the same input and output schemas. However, now that we have subgraphs, it is now possible to have a subgraph as a leaf node. Subgraphs do not share this characteristic so we must account for that. To do so, an optional kwarg was added to combine schemas, input_schemas, to allow the user to specify if they want to retrieve an input schema or an output schema.
oliverholworthy commented 1 year ago

2. when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).

@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).

jperez999 commented 1 year ago

Left some style suggestions, but approving as is. You can take or leave the suggestions and merge when you're ready.

I will make the updates in a subsequent PR.

  1. when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).

@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).

The Subgraph is a special type of operator, it is considered a graph. This means it should allow for all the capabilities/responsibilities of a graph. We are working toward a position where you could have certain subgraphs in different states, where subgraphs that have been fit already might get connected to other subgraphs that have not been fit. In this case, if you do not allow for subgraphs to fit then each node would have to carry that information. By allowing the subgraph to fit and transform instead of just being a placeholder/divider of nodes, it provides a way to keep that information about fitting and possibly other metadata in the future, at the graph (subgraph) level. We have not yet completely built that feature but we are taking incremental steps in this direction. If we just flattened everything out, we would lose the information about the subgraph as a whole.