General Discussion regarding Ensemble

PaulJonasJost commented 3 months ago

There are a multitude of Issues currently opened regarding ensembles (see for example and further context #1357, #1349, #1296, #1294, #1291). We should use this to have a general discussion on the purpose and reasonability of Ensembles. We should use this issue to discuss topics regarding Ensembles to pool future improvements. Here are aspects I think we should consider (based on my own opinion, open PRs and discussion), which is certainly not a complete list:

General Purpose Questions for the Ensemble class
- What scope should the Ensemble class cover? Currently, an Ensemble is only considered an accumulation of vectors, implicitly assuming that general model structure is always the same. This however does not need to be the case and a more general definition would be a tuple of (a parameter vector, a model).
- What use should an Ensemble class serve? Currently the ensemble class serves as a loose collection of parameter vectors that can be summarised and be used to create an ensemble prediction. However, to create a prediction a "predictor" is currently needed, which in my experience is so individualised that I find it questionable how much work we actually save people.
  - Related to this (and Predictor question below): If we leave the individualisation of the predictor up to each person, I think the ensemble is just a collection of parameter vectors as of now. For this sole purpose, I feel that we do not need our own module and might not even need our own class. Either we extend the summary information with things like "how was this ensemble created?", or we might consider removing it completely.
- What simulators should be support? Currently we call it Ensembles in general, but implicitly need an amici model. We do support other simulators as well. It would be completely fine to only support amici ensembles, but then we should make that clear. Consideration here would be, how general we want ensembles to be or whether we should perhaps change it into AmiciEnsemble (which does not necessarily need a complete own module?)
General purpose Questions for the "Prediction/Predictor" Class
- I feel like the main task that is performed with an ensemble is a prediction. If we for now stay in the case, where an objective function is built upon an underlying biological model, a single prediction for me boils down to "I want to model these entities {Parameters, States, Observables} over this time period with this set of parameters {condition parameters, optimised parameters}". This already covers a wide range of use cases and more complex ones often build down to some formula f(parameters, states, observables) instead of just them. This functionality however is not only used in ensembles, but also for example to check model fits or to explore new conditions/ "interventions". I think it might make sense to think about a Predictor Class, as it would streamline a lot of visualization tasks and facilitate model exploration (there is the amiciPredictor class, but I excluded it for the moment, as I think a more general discussion might be helpful). Things to consider here include:
- Do we even think something like this would help In general? If not see the What use should an Ensemble class serve? as a follow up consideration.
- How specified do we want to be? Each simulator has different functionalities, different things that can be put out and most importantly, different ways to retrieve them. It is not plausible to write an "all purpose" predictor. Therefore if we want one, we should think about the general structure and which simulators to support.
- What should the simulator be able to do? Assuming (and I would be fine with that) we have some kind of SBML model underlying, the main functionality for me would be to output a set of SBML_IDs for a given condition and timepoints. Anything further like "all states" or "all observables" would be nice but just icing on the cake for me.

Any thoughts on this are very welcome and also any further questions/considerations.

dilpath commented 3 months ago

Is there a downside to coupling it to PEtab? The upcoming PEtab Result format could cover ensemble predictions as simulation experiments. This could reduce and shift this part

to create a prediction a "predictor" is currently needed, which in my experience is so individualised that I find it questionable how much work we actually save people

into a nicer/simpler format.

I agree, ensembles themselves can be completely decoupled from AMICI or predictions, and simply serve as a thin wrapper around a NumPy array of parameter vectors. Something useful for such a wrapper would be a nice way specify prediction experiments, e.g. how to tell pyPESTO to "create a new ensemble from the current ensemble that predicts a knockout experiment, by setting this parameter to zero". I guess having the ensemble be a pandas.DataFrame could enable this, e.g.

knockout_ensemble = ensemble.copy()
knockout_ensemble["knocked_out_parameter"] = 0

re: supported simulators, if PEtab is used to specify the ensemble predictions, then we could use the petab.simulate.PetabSimulator [1] as the base class for the simulator, such that any simulator that implements enough of the PetabSimulator interface can be used. This base class might need some work.

[1] https://github.com/PEtab-dev/libpetab-python/blob/90379c41611ea941b9865ba8dd724b406b7a31ef/petab/simulate.py

dweindl commented 3 months ago

Thanks for starting this discussion. To reduce complexity, I would suggest to first tackle the higher level questions. What is generated from a parameter ensemble or model ensemble? Is there a common structure that can/should be represented in pypesto? Is the current EnsemblePrediction, PredictionResult, PredictionConditionResult what we want? What will be done with that? The question of support for different types of models and simulators, and where which functionality should be implemented would come further down the road for me.

Currently, an Ensemble is only considered an accumulation of vectors, implicitly assuming that general model structure is always the same.

I think this covers the main use case in pypesto already, but once there exists some concept of model in pypesto, it shouldn't be hard to support the more general case. In case of a bigger refactoring, I would preventively rename Ensemble to ParameterEnsemble, so a ModelEnsemble can be introduced once required.

Is there a downside to coupling it to PEtab?

It wouldn't be usable for any non-PEtab applications. Nevertheless, it might be better to have some easy-to-use functionality coupled to PEtab, than having some practically unusable general concept. In any case, it should be made clear that it is (supposed to be) tied to PEtab.

dilpath commented 3 months ago

What is generated from a parameter ensemble or model ensemble?

I'd be happy to hear more about the use cases for a model ensemble first. If it's the calibrated models from model selection, it might make more sense to move some of this to PEtab Select, e.g. s.t. a PEtab Select model ensemble can be represented by a collection of pyPESTO ParameterEnsembles.

I would preventively rename Ensemble to ParameterEnsemble, so a ModelEnsemble can be introduced once required.

:+1:

PaulJonasJost commented 3 months ago

I'd be happy to hear more about the use cases for a model ensemble first.

The Petab select case was what I mainly thought about. I would think that moving it to PEtab select (or parts of it) makes sense, but would should then clarify what we understand under Ensemble, as Daniel mentioned

I would preventively rename Ensemble to ParameterEnsemble, so a ModelEnsemble can be introduced once required.

But in Petab select we would probably also need some way to create them? 🤔

What is generated from a parameter ensemble or model ensemble? Is there a common structure that can/should be represented in pypesto?

I really think that a very large portion of predictions boils down to "sbml_id" at given timepoints that might not agree with measurements under specific conditions. And I do think there can/should be a structure to represent this in pypesto.

Is the current EnsemblePrediction, PredictionResult, PredictionConditionResult what we want?

Looking at it, I feel like the PredictionResult as a light wrapper is perfectly fine as is, one might be able to condense it by just having this as a dict {condition_id: PredictionConditionId}. Regarding the PredictionConditionResult: This is currently heavily tailored to Amici with the sensitivities, so not entirely sure whether we would even want all the things there.

dilpath commented 3 months ago

Looking at it, I feel like the PredictionResult as a light wrapper is perfectly fine as is, one might be able to condense it by just having this as a dict {condition_id: PredictionConditionId}. Regarding the PredictionConditionResult: This is currently heavily tailored to Amici with the sensitivities, so not entirely sure whether we would even want all the things there.

Is a prediction result as a (PEtab measurements table)-like dataframe sufficient? This would make handling the predictions for e.g. plotting easier than the current implementation, at least. Then extra AMICI/PEtab-specific things can be optional columns.

entity_id	value	[optional] time	[optional] condition_id	[optional] `*`
species_A	5	2	cond1	data1

* model/problem-specific things provided by the simulator, e.g. PEtab dataset ID. Then an ensemble prediction is one big dataframe with an additional vector_id column, or a list of dataframes.

This would make handling the predictions for e.g. plotting much easier than the current implementation. Currently, all data given a specific observable and a specific experiment is retrieved like (summary is EnsemblePrediction.prediction_summary): https://github.com/ICB-DCM/pyPESTO/blob/34e89b3bc88d2052ca808da43c11c57da75bec04/pypesto/visualize/sampling.py#L176-L181

dweindl commented 3 months ago

Looking at it, I feel like the PredictionResult as a light wrapper is perfectly fine as is, one might be able to condense it by just having this as a dict {condition_id: PredictionConditionId}. Regarding the PredictionConditionResult: This is currently heavily tailored to Amici with the sensitivities, so not entirely sure whether we would even want all the things there.

I am not sure if there is much added value in any of those. So far, the main thing is: 1) creating a parameter ensemble, 2) running simulations and collecting some outputs, and 3) computing and visualizing some statistics. The last step is probably most easily done directly with pandas/seaborn once everything is in a properly organized dataframe. (This shouldn't exclude the option of extending the PEtab visualization functionality to allow plotting things like confidence bands based on some PEtab visualization file.)

Is a prediction result as a (PEtab measurements table)-like dataframe sufficient?

I'd say so.

This would make handling the predictions for e.g. plotting much easier than the current implementation.

Yes.

PaulJonasJost commented 2 months ago

Is a prediction result as a (PEtab measurements table)-like dataframe sufficient?

We would obviously somehow need to allow for not only observables to be put there, otherwise, I think you are right, would make handling visualization much easier.

ICB-DCM / pyPESTO

General Discussion regarding Ensemble #1358