digiLab-ai / twinLab-Interface

Augmenting engineering workflows with Probabilistic Machine Learning
https://twinlab.ai/
10 stars 0 forks source link

Feature Request: Using `emulator.learn()` with Simulation Uncertainty Data #4

Open LucasPigott opened 2 months ago

LucasPigott commented 2 months ago

Is your feature request related to a problem? If so, please describe.

I am trying to run emulator.learn() using a simulation that outputs two datasets; the simulation results and the standard deviation in these values. When I try this, I get the error: ValueError: The specified estimator type, fixed_noise_gp, is not currently supported. Please check the EstimatorParams documentation for more available estimator types. When I try a different estimator type, I get the error: ValueError: Must pass 2-d input. shape=(2, 1, 10) .

Describe the solution you'd like

I’d like to be able to use emulator.learn() on a simulation with noise and to be able to train an emulator on this output noise, updating the noise dataset with the results dataset inside the emulator.learn() function. Below, I've included an example of code I would like to be able to run using emulator.learn().

Describe alternatives you've considered

Emulator.learn() works when I return only the simulation results and don’t train the emulator on the simulation noise.

Possible workarounds

I’ve been able to use emulator.recommend() in a function that performs active learning on my dataset, with a fixed noise GP. Recommending a new data point, uploading the outputs for the simulation noise and results, and retrains the emulator.

Here is an example of a simulation function that generates a results dataset and a noise dataset that I would to run with a fixed noise GP inside of emulator.learn():

[tool.poetry.dependencies]
python = ">=3.10,<3.13"
twinlab = "2.11.0"
scikit-learn = "^1.5.1"

Code making example initial datasets of simulation results and corresponding uncertainty:

import pandas as pd
import twinlab as tl
import numpy as np

num_inputs = 3
num_outputs = 10
num_rows = 10

# Create a simple dataset with 3 input columns and 10 output columns
input_columns = [f"x_{i}" for i in range(num_inputs)]
output_columns = [f'y_{i}' for i in range(num_outputs)]

# Create random input data (10 samples)
inputs = np.random.rand(num_rows, len(input_columns)) * 10  # Random values between 0 and 10

# Create correlated output data with some noise
weights = np.random.rand(num_outputs, num_inputs)  # Random weights for linear combination
outputs = np.dot(inputs, weights.T) + np.random.randn(num_rows, num_outputs) * 0.5  # Linear relationship with noise

# Create corresponding standard deviations (can be a function of the outputs, or fixed)
std_devs = np.abs(np.random.randn(num_rows, len(output_columns)) * 0.2)  # Smaller standard deviations

# Convert to DataFrames
input_df = pd.DataFrame(inputs, columns=input_columns)
output_df = pd.DataFrame(outputs, columns=output_columns)
std_df = pd.DataFrame(std_devs, columns=output_columns)

# Make sim_results_df
simulation_results = pd.concat([input_df, output_df], axis=1)

# Display the results
simulation_results
# Display the noise dataframe
std_df

Make an example function that returns simulation outputs and simulation uncertainty for a set of inputs:

def run_simulation(input_params):
    num_outputs = 10  # Number of output columns

    # If input_params is a DataFrame, convert it to a NumPy array
    if isinstance(input_params, pd.DataFrame):
        input_params = input_params.values.flatten()  # Flatten to 1D array

    # Ensure input_params is now a 1D NumPy array
    input_params = np.asarray(input_params).flatten()

    # Example simple model: outputs are linear combinations of inputs with some noise
    outputs = np.dot(input_params, np.random.rand(len(input_params), num_outputs)) + np.random.randn(num_outputs)

    # Example standard deviations (simulated here as random noise)
    std_devs = np.abs(np.random.randn(num_outputs) * 0.2)  # Smaller standard deviations

    # Convert outputs and standard deviations to DataFrames
    output_columns = [f'y_{i}' for i in range(num_outputs)]
    output_df = pd.DataFrame([outputs], columns=output_columns)
    std_df = pd.DataFrame([std_devs], columns=output_columns)

    return output_df, std_df

Uploading the initial datasets:

dataset = tl.Dataset('feature_request_dataset')

# Upload the dataset, passing in the dataframe
dataset.upload(simulation_results)
uncertainty_dataset = tl.Dataset('feature_request_uncertainty_dataset')

# Upload the dataset, passing in the uncertainty dataframe
uncertainty_dataset.upload(std_df)

Initialising and training the emulator:

# Initialise emulator
emulator_id = "feature_request_emulator"

emulator = tl.Emulator(id=emulator_id)
estimator_params = tl.EstimatorParams(
    estimator_type="fixed_noise_gp"
)
train_params = tl.TrainParams(
    estimator='gaussian_process_regression',
    dataset_std=uncertainty_dataset,
    estimator_params=estimator_params,
)
# Train the emulator using the train method
emulator.train(
    dataset=dataset,
    inputs=input_columns,
    outputs=output_columns,
    params=train_params,
)

Using emulator.learn():

emulator.learn(
    dataset=dataset, 
    inputs=input_columns, 
    outputs=output_columns, 
    num_loops=1, 
    num_points_per_loop=1, 
    acq_func="LogEI", 
    simulation=run_simulation, 
    train_params=train_params, 
)

ValueError Traceback (most recent call last) in <cell line: 1>() ----> 1 emulator.learn( 2 dataset=dataset, 3 inputs=input_columns, 4 outputs=output_columns, 5 num_loops=1,

/usr/local/lib/python3.10/dist-packages/twinlab/emulator.py in learn(self, dataset, inputs, outputs, num_loops, num_points_per_loop, acq_func, simulation, train_params, recommend_params, verbose) 1319 ] 1320 if train_params.estimator_params.estimator_type in invalid_GP_estimators: -> 1321 raise ValueError( 1322 f"The specified estimator type, {train_params.estimator_params.estimator_type}, is not currently supported. Please check the EstimatorParams documentation for more available estimator types." 1323 )

ValueError: The specified estimator type, fixed_noise_gp, is not currently supported. Please check the EstimatorParams documentation for more available estimator types.

alexander-mead commented 2 months ago

Thanks for this issue. The error is a bit misleading, what it means is that fixed_noise_gp is not supported for Emulator.learn(), rather than it not being supported at all.

We are working on adding support for this, but it won't be available for a few weeks.

In the meantime, a workaround is to write the active learning out yourself long form (have a look at the code to see how to do this), but it essentially loops over a series of calls to "train"/"recommend". That way you will be able to provide the noise predictions back into the emulator (this is what is currently missing from the implementation of Emulator.learn()).