MLBazaar / MLBlocks

A library for composing end-to-end tunable machine learning pipelines.
https://mlbazaar.github.io/MLBlocks
MIT License
114 stars 35 forks source link

Return specified intermediate outputs #104

Closed AlexanderGeiger closed 5 years ago

AlexanderGeiger commented 5 years ago

Description

We want to introduce a way of specifying exactly which variable(s) from which primitive(s) should be returned. This way we would have the ability to get multiple intermediate outputs from the pipeline without needing to return the whole context.

Possible approach

We let the user define in the Pipeline JSON which intermediate outputs he wants to see. Using that information, MLBlocks keeps track of the outputs while iterating over the primitives and returns a dictionary containing all the outputs. The JSON could look like:

{
...
"intermediate_output":
        ["sklearn.preprocessing.MinMaxScaler#1.X",
        "keras.Sequential.LSTMTimeSeriesRegressor#1.y"] 
}

Also, we might want to add a general output field to the JSON, where the user can specify what the last output of the pipeline will be and that will be returned as an array. Then we would have the general output of the pipeline and the intermediate outputs.

@csala you already had some specifics about the implementation in mind, so please let me know what you think about it and how you would do it.

dyuliu commented 5 years ago

@AlexanderGeiger

I like the way to dump the intermediate outputs like this.

One question about adding a general output field to the JSON. What is the difference between "general output field" and "intermediate output"? They seem serving the same purpose. In this regard, why not just name the "intermediate_output" to "output"?

csala commented 5 years ago

I like the approach of specifying the outputs in the JSON file, but I want to suggest a slightly different JSON structure.

The concepts would be:

Then, in the JSON I would do the following:

Some examples of possible specifications:

"output": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"

"output": [
    "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"
]

"output": {
    "default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
}

"output": {
    "default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
    "debug": [
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
        "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
    ]
}

Notice how the first three examples are completely equivalent, and only the last one introduces an alternative output. Also, internally, all the three options will end up represented in the dict format, with an entry called "default". Finally, notice how the default output is NOT included in the list of debug outputs.

Now, the behavior will be: when executing the pipeline, the output_ argument will allow the user to either specify specific outputs (like in the current behavior) or give "named outputs". These named outputs can be "default" or any other name specified, like "debug".

And the internal behavior will be:

Finally, when returning, if the output specification ends up having a single element, that element will be returned alone. If more than one element exists in the output specification, all the elements will be returned as a tuple, in the exact same order, like in any multi-output method call.

Following this specification, if a pipeline is created using the last output example above, all these calls would be valid:

# return the default output, which is the y in the last primitive
anomalies = pipeline.predict(X)
anomalies = pipeline.predict(X, output_="default")

# return ONLY the debug outputs
X, y, target_index, y_hat = pipeline.predict(X, output_="debug")

# return BOTH the default and the debug outputs
anomalies, X, y, target_index, y_hat = pipeline.predict(X, output_=["default", "debug"])

# return ONLY one variable, y_hat
y_hat = pipeline.predict(X, output_="keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat")

# return the default output and also one variable
y_hat = pipeline.predict(X, output_=["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])

On a side note, the "get the whole context" behavior from the current implementation should be kept. This means that, even though the JSON specification will always require the {variable-name}, the output_ contents can point at a particular primitive context without variable. In this case, a deep copy of that context will be returned in that place.

csala commented 5 years ago

Here is an additional proposal on top of the previous one.

A part from specifying the outputs in the JSON file as a single string, allow them to be specified as a dictionary with two entries:

On top of that, add these two methods to the MLPipeline object:

For example, if the pipeline JSON specifies:

"output": {
    "default": {
        "name": "events"
        "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    },
    "debug": [
        {
            "name": "X",
            "variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
        },
        {
            "name": "y",
            "variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
        },
        {
            "name": "index",
            "variable: "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
        {
            "name": "y_hat",
            "variable": "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
        }
    ]
}

One can do:

>>> pipeline.get_outputs()
[
    {
            "name": "events"
            "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    }
]
>>> pipeline.get_outputs_names()
["events"]
>>> pipeline.get_outputs(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
[
    {
            "name": "events"
            "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    },
    "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"
]
>>> pipeline.get_output_names(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
["anomalies", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"]
>>> pipeline.get_output_names(["default", "debug"])
["anomalies", "X", "y", "index", "y_hat"]

And, potentially:

>>> outputs = ["default", "debug"]
>>> output_names = pipeline.get_output_names(outputs)
>>> output_values = pipeline.predict(data, output_=outputs)
>>> output_dict = dict(zip(output_names, output_values))
>>> output_dict
{
    "anomalies": ...,
    "X": ...
    "y": ...
    ...
}