PyWorkflowApp / visual-programming

A Python Visual Programming Workspace for Data Science
MIT License
31 stars 12 forks source link

Refined Node execution/validation; flow variable substitution #57

Closed reelmatt closed 4 years ago

reelmatt commented 4 years ago

Node execution/validation

The main Node class now implements a validation method that checks all Parameters, calling their option.validate() methods. The calls to validate within pyworkflow/workflow.py have also been removed. This addresses the use case mentioned where a Node could potentially be invalid when added to the workspace. Now, Nodes are validated upon saving the configuration, or the 'node update' endpoint.

Also changed is the order of operations for Node execution. The main logic in the Workflow class has been simplified to:

# Load predecessor data and FlowNode values
preceding_data = self.load_input_data(node_to_execute.node_id)
flow_nodes = self.load_flow_nodes(node_to_execute.option_replace)

# Validate input data, and replace flow variables
node_to_execute.validate_input_data(len(preceding_data))
execution_options = node_to_execute.get_execution_options(flow_nodes)

# Pass in data to current Node to use in execution
output = node_to_execute.execute(preceding_data, execution_options)

The thinking behind moving input data validation and flow variable substitution into the Workflow class, is anyone developing a custom Node would not have to duplicate this work. Even for the built-in Nodes there was a lot of duplication where each Node validated input and parameters.

The validate_input_data and newly-named get_execution_options were moved from the NodeUtils class into the main Node class. This approach seems cleaner to me, but there may be use cases I'm not considering.

Flow variable substitution

There are currently no front-end options to create or pass these flow variables around, but this PR should handle most of the back-end functionality and updates the endpoint responses for the front-end to use. Currently, flow variables work by adding the following attributes to a Node (for example ReadCsv, written here in JSON:

"option_replace": {
    "sep": {
        "node_id": "2",
        "is_global": true
    }
}

On execution, the back-end would substitute the default_value stored in the global FlowNode with ID "2", for whatever the current value is for "sep" in the ReadCsvNode. Should the user want to revert "sep" back to whatever stored value was there, they would remove the flow variable (perhaps a checkbox, or empty selection from a dropdown?).

To pull this information in to the "Node Configuration" pop-up in the front-end, the action would need to call the GET /node/{node_id} endpoint which has been updated to return the Node's information as well as any flow variable options (all global variables, and any local variables connected as predecessors).

TODO:

This PR updates all Nodes to use the Parameter classes @reddigari introduced in #53. Only Read/WriteCsv and the JoinNode have been updated to use these Parameters during execution; the rest still pass in **self.options) which may cause issues.

There is no validation to ensure a FlowNode matches the Parameter type of a given option. E.g., there are no checks to prevent a StringNode value from replacing a BooleanParameter value.

Integration with front-end. Nested-forms might be the solution to provide the JSON data the back-end currently is using, but both back/front might need some changes as we integrate.

reddigari commented 4 years ago

Am I understanding correctly that local flow vars are just regular nodes in the graph, which are caught when parsing the predecessor input? Whereas global ones aren't connected to anything, but are available from within the node config form to override a particular parameter?

If you could give me a high-level list of what needs to happen on the front end for this to be supported, that would help a ton! If I am understanding correctly, this includes:

reelmatt commented 4 years ago

The list you have hits most of the high-level needs of the front-end. I made a few in-depth comments to elaborate on these (and the endpoints to hit) in issues #64 and #65.

In general, your understanding of how the flow variables work is correct. A Workflow contains:

The differentiator between the two is whether a Node has the is_global attribute set to true or false. My thought was the only way to include this flag is where the Node is defined. A separate area for displaying/adding global flow variables could help with that.

Global variables are defined in a separate Graph() to prevent connections to specific nodes in the "main" graph. By hitting the GET /node/<node_id> endpoint, all potential flow variables (both local and global) are retrieved and returned in a JSON response.