Vectorization with emcee solver

JanKoune commented 2 years ago

Emcee allows for vectorized calls to the log-likelihood function by setting vectorize=True in the EnsembleSampler. With this option enabled, the input to the log-likelihood function is a 2D array of theta values of size (n_walkers x n_theta). Doing this in probeye results in an error in check_parameter_domains (see minimal example below).

Having this functionality could be useful for:

Parallelizing MCMC with a computationally expensive forward model, e.g. using offloading.
Faster inference with cheap surrogate models that accept vectorized inputs.

JanKoune commented 2 years ago

A minimal example based on the linear regression integration test:

import numpy as np
from probeye.definition.inverse_problem import InverseProblem
from probeye.definition.forward_model import ForwardModelBase
from probeye.definition.sensor import Sensor
from probeye.definition.likelihood_model import GaussianLikelihoodModel
from probeye.inference.emcee.solver import EmceeSolver

# ============================================================================ #
#                              Set numeric values                              #
# ============================================================================ #
# Options
n_steps = 200
n_initial_steps = 100
n_walkers = 20

# 'true' value of a, and its normal prior parameters
a_true = 2.5
mean_a = 2.0
std_a = 1.0

# 'true' value of b, and its normal prior parameters
b_true = 1.7
mean_b = 1.0
std_b = 1.0

# 'true' value of additive error sd, and its uniform prior parameters
sigma = 0.5
low_sigma = 0.0
high_sigma = 0.8

# the number of generated experiment_names and seed for random numbers
n_tests = 50
seed = 1

# ============================================================================ #
#                           Define the Forward Model                           #
# ============================================================================ #

class LinearModel(ForwardModelBase):
    def interface(self):
        self.parameters = [{"a": "m"}, "b"]
        self.input_sensors = Sensor("x")
        self.output_sensors = Sensor("y", std_model="sigma")

    def response(self, inp: dict) -> dict:
        # this method *must* be provided by the user
        x = inp["x"]
        m = inp["m"]
        b = inp["b"]
        return {"y": m * x + b}

    def jacobian(self, inp: dict) -> dict:
        x = inp["x"]  # vector
        one = np.ones((len(x), 1))
        return {"y": {"x": None, "m": x.reshape(-1, 1), "b": one}}

# ============================================================================ #
#                         Define the Inference Problem                         #
# ============================================================================ #
problem = InverseProblem("Linear regression (AME)")

problem.add_parameter(
    "a",
    "model",
    tex="$a$",
    info="Slope of the graph",
    prior=("normal", {"mean": mean_a, "std": std_a}),
)
problem.add_parameter(
    "b",
    "model",
    info="Intersection of graph with y-axis",
    tex="$b$",
    prior=("normal", {"mean": mean_b, "std": std_b}),
)
problem.add_parameter(
    "sigma",
    "likelihood",
    domain="(0, +oo)",
    tex=r"$\sigma$",
    info="Standard deviation, of zero-mean additive model error",
    prior=("uniform", {"low": low_sigma, "high": high_sigma}),
)

# add the forward model to the problem
linear_model = LinearModel("LinearModel")
problem.add_forward_model(linear_model)

# ============================================================================ #
#                    Add test data to the Inference Problem                    #
# ============================================================================ #
# data-generation; normal likelihood with constant variance around each point
np.random.seed(seed)
x_test = np.linspace(0.0, 1.0, n_tests)
y_true = linear_model.response(
    {linear_model.input_sensor.name: x_test, "m": a_true, "b": b_true}
)[linear_model.output_sensor.name]
y_test = np.random.normal(loc=y_true, scale=sigma)

# add the experimental data
problem.add_experiment(
    f"TestSeries_1",
    fwd_model_name="LinearModel",
    sensor_values={
        linear_model.input_sensor.name: x_test,
        linear_model.output_sensor.name: y_test,
    },
)

# ============================================================================ #
#                           Add likelihood model(s)                            #
# ============================================================================ #
# add the likelihood model to the problem
problem.add_likelihood_model(
    GaussianLikelihoodModel(
        prms_def="sigma",
        experiment_name="TestSeries_1",
        model_error="additive",
        name="SimpleLikelihoodModel",
    )
)

# ============================================================================ #
#                    Solve problem with inference engine(s)                    #
# ============================================================================ #
emcee_solver = EmceeSolver(
    problem,
    show_progress=True,
)

inference_data = emcee_solver.run_mcmc(
    n_walkers=n_walkers,
    n_steps=n_steps,
    n_initial_steps=n_initial_steps,
    vectorize=True
)

aklawonn commented 2 years ago

I looked into this, and I saw that there are quite a few changes necessary in order to make this feature possible. But I also think the code will get more complicated, because there will be much more ifs and for-loops in a lot of methods and function. So, it is possible, but at the cost of additional code complexity. If there is not an urgent need, I would not implement this at the moment.

BAMresearch / probeye

Vectorization with emcee solver #83