Surrogate models - Githubissues

aklawonn commented 2 years ago

It would be great, if probeye had the possibility to directly define and use surrogate models to reduce the computing time when working with computationally expensive forward models.

joergfunger commented 2 years ago

I could imagine the metamodel to be a forward model and then only pass the metamodel to the inference problem. I would also suggest to have three steps to create a metamodel

my_metamodel = gp_forwardmodel(orig_forward_model, additional_parameters_of_gp)
training_samples_input = problem.sample_from_prior()
training_samples_output = orig_forward_model.response(training_samples_input)
my_metamodel.train(training_samples_input, training_samples_output, training_parameters)

and then use the metamodel as a "normal" forward model

problem.add_forward_model(my_metamodel)

aklawonn commented 2 years ago

I think this is a good structure. However, I'm not sure if we want to have the training of the surrogate model in the definition part of the problem. In your proposal, one would have to wait (possibly a long time) until the surrogate model is ready to be added, and only then the definition of the problem could continue. Maybe it would be better to do the surrogate-training in the solver routine after the problem is fully defined.

joergfunger commented 2 years ago

I think this would not make a difference, because the python engine would then wait that same time in the solver. In the longterm, I would probably even decompose that into two workflow steps (e.g. pydoit or nextflow), such that the result of the training of the metamodel (the outputs) can be stored and is only recomputed if needed.

aklawonn commented 2 years ago

That's true, it wouldn't make a difference in terms of computation time. And also the pydoit approach makes sense. However, I would find the problem definition structure cleaner, if no computations happen in the definition phase of the problem. All of the heavy lifting would happen after the problem is defined (and checks have made sure, that the problem definition makes sense).

Maybe the surrogate model could have a flag that indicates, whether it was already trained or not. This would be checked by the solver, and if no training was done yet (i.e., no respective training-files are found) it would run the training before starting the inference step.

joergfunger commented 2 years ago

Yes, that could also be done (so the train is internally called if it has not been trained before). The only think that is important for me is that we store the metamodel as a standard forward model in the parameter estimation problem - not both in parallel (though the metamodel should actually have the exact forward model stored, but that should not be used within the inference problem). This would make the implementation easier and less coupled (because the metamodel is just another forward model and can be developed and tested independently of the parameter estimation problem).

aklawonn commented 2 years ago

Sure thing, I will update the code accordingly.

atulag0711 commented 2 years ago

I could imagine the metamodel to be a forward model and then only pass the metamodel to the inference problem. I would also suggest to have three steps to create a metamodel
my_metamodel = gp_forwardmodel(orig_forward_model, additional_parameters_of_gp)
training_samples_input = problem.sample_from_prior()
training_samples_output = orig_forward_model.response(training_samples_input)
my_metamodel.train(training_samples_input, training_samples_output, training_parameters)
and then use the metamodel as a "normal" forward model
problem.add_forward_model(my_metamodel)

I think you talked about adaptive training (adaptively querry the forward model to adhere to the fixed computational budget). In that scenario, this would be difficult.

my_metamodel = gp_forwardmodel(orig_forward_model, additional_parameters_of_gp) training_samples_input = problem.sample_from_prior()

my_metamodel.train(x_init = training_samples_input) problem.add_surrogate_model(my_metamodel)

Surrogating just needs forward model with input and output, some initial input values (maybe samples from the prior) and the bounds of the inputs. The training is performed externally (nothing to do with Probeye based interface). Once trained, it add be added to inference problem. If the training of surrogate is done in probeye, it will be complicate things IMO.

joergfunger commented 2 years ago

That is right, but those are all given already in probeye. So creating a metamodel e.g. using a GP based on a LHS of the prior and a computation of the corresponding forward models would mean a single line in the code. Sure, you could create your own metamodel outside and use it here, but that would be much more code to be added. At least for some standard cases, this would be nice to directly be incorporated here.

joergfunger commented 2 years ago

As for the adaptive case, the metamodel would still be able to call additional forward model function evaluations, thus even an adaptive metamodel would work (since the metamodel has the forward model stored).

JanKoune commented 2 years ago

Adding to @atulag0711's comment regarding adaptive sampling, it may also be convenient to consider a separation of the metamodel (e.g. GP, NN, etc.) and sampling approach (LHS, active/adaptive sampling etc.), since these can be combined arbitrarily depending on the problem at hand. A modified version of the code snippet that can deal with that case (I am not sure what a better term would be for the combined metamodel + sampler):

my_surrogate = Surrogate(orig_forward_model, surrogate_kwargs)
my_metamodel = Sampler(my_surrogate, sampler_kwargs)
problem.add_surrogate_model(my_metamodel )
my_metamodel.train(training_parameters)

Some notes on this approach:

As long as all Metamodels implement the same methods (e.g. Metamodel.fit(), Metamodel.predict(), Metamodel.evaluate_metric()) and have standardized input/output shapes, the Sampler' class can be agnostic to the Metamodel type (GP, NN, etc.).
The sampler must have access to the properties of problem to get the names, prior distributions, types and bounds of parameters, or any other information that is stored in the InferenceProblem class.

joergfunger commented 2 years ago

Separaring the sampler on the script level is a good idea. I think it would probably make sense, to actually include at least some basic samplers in the ParameterEstimation problem, since e.g. the prior distributions etc is all given (e.g. a method in ParameterEstimation sample_LHS_from_prior(num_samples=100). And this function should then return a format (dict) that we could directly use in the forward model (so essentially returning an array[num_samples] of dicts with all the parameters. And as mentioned above, I would not add the surrogate as an additional feature, but rather as a standard forward model. That said, we could still have a MetaModel base class (derived from ForwardModel) that implements these fit, function, but predict would IMO be the function already implemented in ForwardModel (evaluate). What would be the metric?

BAMresearch / probeye

Surrogate models #78