Return specified intermediate outputs

AlexanderGeiger commented 5 years ago

Description

We want to introduce a way of specifying exactly which variable(s) from which primitive(s) should be returned. This way we would have the ability to get multiple intermediate outputs from the pipeline without needing to return the whole context.

Possible approach

We let the user define in the Pipeline JSON which intermediate outputs he wants to see. Using that information, MLBlocks keeps track of the outputs while iterating over the primitives and returns a dictionary containing all the outputs. The JSON could look like:

{
...
"intermediate_output":
        ["sklearn.preprocessing.MinMaxScaler#1.X",
        "keras.Sequential.LSTMTimeSeriesRegressor#1.y"] 
}

Also, we might want to add a general output field to the JSON, where the user can specify what the last output of the pipeline will be and that will be returned as an array. Then we would have the general output of the pipeline and the intermediate outputs.

@csala you already had some specifics about the implementation in mind, so please let me know what you think about it and how you would do it.

dyuliu commented 5 years ago

@AlexanderGeiger

I like the way to dump the intermediate outputs like this.

One question about adding a general output field to the JSON. What is the difference between "general output field" and "intermediate output"? They seem serving the same purpose. In this regard, why not just name the "intermediate_output" to "output"?

csala commented 5 years ago

I like the approach of specifying the outputs in the JSON file, but I want to suggest a slightly different JSON structure.

The concepts would be:

An output specification is a string that follows the same pattern as the current output_ specification: {primitive-name}#{counter}.{variable-name}. However, contrary to the fit and predict output_ argument, in this case the {variable-name} part is mandatory and cannot be skipped.
Optionally, the output specification can be a list of strings instead of a string, all of them following the same specification.

Then, in the JSON I would do the following:

Add an outputs field in the JSON. This field is optional, and can be:
- missing: all the outputs from the last pipeline step produce method will be taken as the default output specification, in the same order (this is the current behavior).
- A string or a list: it is just the default pipeline output specification.
- A dictionary: it must contain at least one entry called default. This entry will be considered the default pipeline output specification just like in the previous steps. And, a part from the default, any other named output specifications can be added.

Some examples of possible specifications:

"output": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"

"output": [
    "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"
]

"output": {
    "default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
}

"output": {
    "default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
    "debug": [
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
        "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
    ]
}

Notice how the first three examples are completely equivalent, and only the last one introduces an alternative output. Also, internally, all the three options will end up represented in the dict format, with an entry called "default". Finally, notice how the default output is NOT included in the list of debug outputs.

Now, the behavior will be: when executing the pipeline, the output_ argument will allow the user to either specify specific outputs (like in the current behavior) or give "named outputs". These named outputs can be "default" or any other name specified, like "debug".

And the internal behavior will be:

If no output_ is given, "default" will be used.
If a string is given, it can either be one of the named outputs ("default", "debug", etc.) or one individual output specification ("{primitive-name}#{counter}.{variable-name}").
If a list is given, each element in the list can either be a named output or an individual output specification. If named outputs are given, they will be concatenated, forming a single output specification that contains all the elements from all the named outputs, in order.

Finally, when returning, if the output specification ends up having a single element, that element will be returned alone. If more than one element exists in the output specification, all the elements will be returned as a tuple, in the exact same order, like in any multi-output method call.

Following this specification, if a pipeline is created using the last output example above, all these calls would be valid:

# return the default output, which is the y in the last primitive
anomalies = pipeline.predict(X)
anomalies = pipeline.predict(X, output_="default")

# return ONLY the debug outputs
X, y, target_index, y_hat = pipeline.predict(X, output_="debug")

# return BOTH the default and the debug outputs
anomalies, X, y, target_index, y_hat = pipeline.predict(X, output_=["default", "debug"])

# return ONLY one variable, y_hat
y_hat = pipeline.predict(X, output_="keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat")

# return the default output and also one variable
y_hat = pipeline.predict(X, output_=["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])

On a side note, the "get the whole context" behavior from the current implementation should be kept. This means that, even though the JSON specification will always require the {variable-name}, the output_ contents can point at a particular primitive context without variable. In this case, a deep copy of that context will be returned in that place.

csala commented 5 years ago

Here is an additional proposal on top of the previous one.

A part from specifying the outputs in the JSON file as a single string, allow them to be specified as a dictionary with two entries:

name: a final name for the output. For example, "anomalies".
variable: the output specification from above: {primitive-name}#{counter}.{variable-name}.

On top of that, add these two methods to the MLPipeline object:

get_outputs(outputs=None): Return the list of dictionaries with the specification of the outputs that will be returned. If no outputs are passed, return the default outputs. Otherwise, if some outputs specification is given, compute the outputs and return the list of their specifications.
get_output_names(outputs=None): Just like get_outputs, but return the name of each output instead of the complete specification. If an output has no name because it was a single string, return the string.

For example, if the pipeline JSON specifies:

"output": {
    "default": {
        "name": "events"
        "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    },
    "debug": [
        {
            "name": "X",
            "variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
        },
        {
            "name": "y",
            "variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
        },
        {
            "name": "index",
            "variable: "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
        {
            "name": "y_hat",
            "variable": "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
        }
    ]
}

One can do:

>>> pipeline.get_outputs()
[
    {
            "name": "events"
            "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    }
]
>>> pipeline.get_outputs_names()
["events"]
>>> pipeline.get_outputs(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
[
    {
            "name": "events"
            "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    },
    "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"
]
>>> pipeline.get_output_names(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
["anomalies", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"]
>>> pipeline.get_output_names(["default", "debug"])
["anomalies", "X", "y", "index", "y_hat"]

And, potentially:

>>> outputs = ["default", "debug"]
>>> output_names = pipeline.get_output_names(outputs)
>>> output_values = pipeline.predict(data, output_=outputs)
>>> output_dict = dict(zip(output_names, output_values))
>>> output_dict
{
    "anomalies": ...,
    "X": ...
    "y": ...
    ...
}

MLBazaar / MLBlocks

Return specified intermediate outputs #104

Description

Possible approach