Closed AlexanderGeiger closed 5 years ago
@AlexanderGeiger
I like the way to dump the intermediate outputs like this.
One question about adding a general output field to the JSON. What is the difference between "general output field" and "intermediate output"? They seem serving the same purpose. In this regard, why not just name the "intermediate_output" to "output"?
I like the approach of specifying the outputs in the JSON file, but I want to suggest a slightly different JSON structure.
The concepts would be:
output_
specification: {primitive-name}#{counter}.{variable-name}
. However, contrary to the fit
and predict
output_
argument, in this case the {variable-name}
part is mandatory and cannot be skipped.Then, in the JSON I would do the following:
outputs
field in the JSON. This field is optional, and can be:
default
. This entry will be considered the default pipeline output specification just like in the previous steps. And, a part from the default, any other named output specifications can be added.Some examples of possible specifications:
"output": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"
"output": [
"mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"
]
"output": {
"default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
}
"output": {
"default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
"debug": [
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
"keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
]
}
Notice how the first three examples are completely equivalent, and only the last one introduces an alternative output.
Also, internally, all the three options will end up represented in the dict format, with an entry called "default"
.
Finally, notice how the default output is NOT included in the list of debug outputs.
Now, the behavior will be: when executing the pipeline, the output_
argument will allow the user to either specify specific outputs (like in the current behavior) or give "named outputs".
These named outputs can be "default"
or any other name specified, like "debug"
.
And the internal behavior will be:
output_
is given, "default"
will be used."default"
, "debug"
, etc.) or one individual output specification ("{primitive-name}#{counter}.{variable-name}"
).Finally, when returning, if the output specification ends up having a single element, that element will be returned alone. If more than one element exists in the output specification, all the elements will be returned as a tuple, in the exact same order, like in any multi-output method call.
Following this specification, if a pipeline is created using the last output example above, all these calls would be valid:
# return the default output, which is the y in the last primitive
anomalies = pipeline.predict(X)
anomalies = pipeline.predict(X, output_="default")
# return ONLY the debug outputs
X, y, target_index, y_hat = pipeline.predict(X, output_="debug")
# return BOTH the default and the debug outputs
anomalies, X, y, target_index, y_hat = pipeline.predict(X, output_=["default", "debug"])
# return ONLY one variable, y_hat
y_hat = pipeline.predict(X, output_="keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat")
# return the default output and also one variable
y_hat = pipeline.predict(X, output_=["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
On a side note, the "get the whole context" behavior from the current implementation should be kept. This means that, even though the JSON specification will always require the {variable-name}
, the output_
contents can point at a particular primitive context without variable. In this case, a deep copy of that context will be returned in that place.
Here is an additional proposal on top of the previous one.
A part from specifying the outputs in the JSON file as a single string, allow them to be specified as a dictionary with two entries:
name
: a final name for the output. For example, "anomalies"
.variable
: the output specification from above: {primitive-name}#{counter}.{variable-name}
.On top of that, add these two methods to the MLPipeline object:
get_outputs(outputs=None)
: Return the list of dictionaries with the specification of the outputs that will be returned. If no outputs
are passed, return the default
outputs. Otherwise, if some outputs specification is given, compute the outputs and return the list of their specifications.get_output_names(outputs=None)
: Just like get_outputs
, but return the name
of each output instead of the complete specification. If an output has no name because it was a single string, return the string.For example, if the pipeline JSON specifies:
"output": {
"default": {
"name": "events"
"variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
},
"debug": [
{
"name": "X",
"variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
},
{
"name": "y",
"variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
},
{
"name": "index",
"variable: "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
{
"name": "y_hat",
"variable": "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
}
]
}
One can do:
>>> pipeline.get_outputs()
[
{
"name": "events"
"variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
}
]
>>> pipeline.get_outputs_names()
["events"]
>>> pipeline.get_outputs(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
[
{
"name": "events"
"variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
},
"keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"
]
>>> pipeline.get_output_names(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
["anomalies", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"]
>>> pipeline.get_output_names(["default", "debug"])
["anomalies", "X", "y", "index", "y_hat"]
And, potentially:
>>> outputs = ["default", "debug"]
>>> output_names = pipeline.get_output_names(outputs)
>>> output_values = pipeline.predict(data, output_=outputs)
>>> output_dict = dict(zip(output_names, output_values))
>>> output_dict
{
"anomalies": ...,
"X": ...
"y": ...
...
}
Description
We want to introduce a way of specifying exactly which variable(s) from which primitive(s) should be returned. This way we would have the ability to get multiple intermediate outputs from the pipeline without needing to return the whole context.
Possible approach
We let the user define in the Pipeline JSON which intermediate outputs he wants to see. Using that information, MLBlocks keeps track of the outputs while iterating over the primitives and returns a dictionary containing all the outputs. The JSON could look like:
Also, we might want to add a general output field to the JSON, where the user can specify what the last output of the pipeline will be and that will be returned as an array. Then we would have the general output of the pipeline and the intermediate outputs.
@csala you already had some specifics about the implementation in mind, so please let me know what you think about it and how you would do it.