New design for `DataSet` and `Estimator`

g-bauer commented 2 years ago

The prediction of a target property for a given model and a set of thermodynamic states is a problem that can easily be parallelized. Our current DataSet trait's predict method is very flexible in that a user can freely decide how to iterate over data and possibly calculate properties that are needed during that iteration. The downside of the current design is that we cannot implement the parallelization in a generic way.

This issue discusses ways to change the DataSet trait so that a generic parallel evaluation of data points can be provided.

In- and ouput of the prediction of a single data point

We could add a method into the trait that returns the prediction for a single data point.

// in trait DataSet
fn predict_datapoint(&self, eos: &Arc<E>, input: SINumber) -> Result<SINumber, EstimatorError>;

The prediction method for all data can then loop over this method. The same method could be reused for a parallel and a sequential prediction method.

However, different properties require different inputs both in terms of the number of inputs as well as the data type(s). E.g.

vapor pressure
- Input: temperature as SINumber
- Output: pressure as SINumber
binary vle using chemical potential
- Input: x, y as f64, t, p as SINumber
- Output: delta in chemical potentials for both species (2 SINumber)

The above predict_datapoint method would work for the vapor pressure but not for the chemical potential.

Option 1: Unify in- and output of the `prediction` method for a data point

We could get rid of the unit in the in- and output and be generic in the length of in- and output data.

Implementation (click)

```rust pub trait DataSet: Send + Sync { /// This function is called prior to calculation of predictions. /// /// Can be used to calculate an internal state needed for the predictions, /// such as critical temperature or Antoine Parameters for vapor pressure /// extrapolation. fn prepare(&mut self, eos: &Arc) -> Result<(), EstimatorError>; /// Prediction of the target given the model and a single data point. fn predict_datapoint(&self, eos: &Arc, input: &[f64]) -> Vec>; /// Return an iterator over input elements. fn input(&self) -> &[Vec]; /// Prediction of the target given the model for all stored data points. fn predict(&mut self, eos: &Arc) -> Result, EstimatorError> { self.prepare(eos)?; self.input() .par_iter() .flat_map(|input| self.predict_datapoint(eos, input)) .collect() } } ```

Advantage:
- relative_difference (and thus the cost function) can be generically implemented since the result of predict is just a Vec<f64>.
- DataSets can be stored in a Vec<Arc<dyn DataSet<E>>> in the Estimator which then can be used as is.
- Return values of predict are always the same which is nice to check/plot how a model's parameters perform.
Disadvantage: We have non-intuitive data types in the interfaces and (possibly) lower performance since lots of Vecs have to be allocated during the iteration (most of the time with a single value)

Option 2: Add associated types to the `DataSet` trait

The interface can be written in a nicer way using associated types and constants. However, the trait is no longer object safe. We'd have to build an enum with all possible implementors of DataSet so that those can be used within an Estimator inside a Vec.

Implementation (click)

```rust pub trait DataSet: Send + Sync { type Input: Send + Sync; // data type of input type Target: Send + Sync; // data type of output fn target(&self) -> &[Self::Target]; /// This function is called prior to calculation of predictions. /// /// Can be used to calculate an internal state needed for the predictions, /// such as critical temperature or Antoine Parameters for vapor pressure /// extrapolation. fn prepare(&mut self, eos: &Arc) -> Result<(), EstimatorError>; /// Prediction of the target given the model and a single data point. fn predict_datapoint( &self, eos: &Arc, input: &Self::Input, ) -> Result; /// Return an iterator over input elements. fn input(&self) -> &[Self::Input]; /// Calculate relative difference fn relative_difference_datapoint( &self, eos: &Arc, input: &Self::Input, target: &Self::Target, ) -> Result<[f64; NTARGET], EstimatorError>; /// Prediction of the target given the model for all stored data points. fn predict(&mut self, eos: &Arc) -> Result, EstimatorError> { self.prepare(eos)?; self.input() .par_iter() .map(|input| self.predict_datapoint(eos, input)) .collect() } fn relative_difference(&mut self, eos: &Arc) -> Result, EstimatorError> { self.prepare(eos)?; // can be done without collecting twice with par_extend? let rel_dif = self .input() .par_iter() .zip(self.target()) .map(|(input, target)| self.relative_difference_datapoint(eos, input, target)) .collect::, EstimatorError>>()?; Ok(rel_dif.par_iter().cloned().flatten().collect()) } } ```

Advantage:
- Implementation is convenient because we can use dimensioned types.
Disadvantage:
- The user has to implement how the relative difference between a prediction and a target can be calculated.
- No longer object safe
- Requires new design for Estimator.

An implementation for the vapor pressure could look like this:

Implementation (click)

pub struct VaporPressure {
    pub target: Vec<SINumber>,
    temperature: Vec<SINumber>,
    max_temperature: SINumber,
    antoine: Option<Antoine>,
    solver_options: SolverOptions,
}

impl VaporPressure {
    /// Create a new data set for vapor pressure.
    ///
    /// If the equation of state fails to compute the vapor pressure
    /// (e.g. when it underestimates the critical point) the vapor
    /// pressure can be estimated.
    /// If `extrapolate` is `true`, the vapor pressure is estimated by
    /// calculating the slope of ln(p) over 1/T.
    /// If `extrapolate` is `false`, it is set to `NAN`.
    pub fn new(
        target: SIArray1,
        temperature: SIArray1,
        extrapolate: bool,
        solver_options: Option<SolverOptions>,
    ) -> Result<Self, EstimatorError> {
        let max_temperature = temperature
            .to_reduced(KELVIN)?
            .into_iter()
            .reduce(|a, b| a.max(b))
            .unwrap()
            * KELVIN;

        let antoine = if extrapolate {
            Some(Antoine::default())
        } else {
            None
        };

        Ok(Self {
            target: target.into_iter().collect(),
            temperature: temperature.into_iter().collect(),
            max_temperature,
            antoine,
            solver_options: solver_options.unwrap_or_default(),
        })
    }
}

impl<E: EquationOfState> DataSet<E, 1> for VaporPressure {
    type Input = SINumber;
    type Target = SINumber;

    fn target(&self) -> &[Self::Target] {
        &self.target
    }

    fn prepare(&mut self, eos: &Arc<E>) -> Result<(), EstimatorError> {
        if self.antoine.is_some() {
            let critical_point = State::critical_point(
                eos,
                None,
                Some(self.max_temperature * KELVIN),
                self.solver_options,
            )?;
            let tc = critical_point.temperature;
            let pc = critical_point.pressure(Contributions::Total);

            let t0 = 0.9 * tc;
            let p0 = PhaseEquilibrium::pure(eos, t0, None, self.solver_options)?
                .vapor()
                .pressure(Contributions::Total);
            self.antoine = Some(Antoine::new(pc, p0, tc, t0)?);
        }
        Ok(())
    }

    fn input(&self) -> &[Self::Input] {
        &self.temperature
    }

    fn predict_datapoint(
        &self,
        eos: &Arc<E>,
        input: &Self::Input,
    ) -> Result<Self::Target, EstimatorError> {
        let vle = PhaseEquilibrium::pure(eos, *input, None, self.solver_options);
        if let Ok(vle) = vle {
            Ok(vle.vapor().pressure(Contributions::Total))
        } else if let Some(antoine) = &self.antoine {
            antoine.pressure(*input)
        } else {
            Ok(f64::NAN * PASCAL)
        }
    }

    fn relative_difference_datapoint(
        &self,
        eos: &Arc<E>,
        input: &Self::Input,
        target: &Self::Target,
    ) -> Result<[f64; 1], EstimatorError> {
        self.predict_datapoint(eos, input)
            .and_then(|prediction| Ok([((target - prediction) / target).into_value()?]))
    }
}

For more complex predictions, e.g. BinaryVleChemicalPotential this could look like so:

click

```rust pub struct BinaryVleChemicalPotential { input: Vec<(f64, f64, SINumber, SINumber)>, target: Vec<(SINumber, SINumber)>, } impl BinaryVleChemicalPotential { pub fn new( temperature: SIArray1, pressure: SIArray1, liquid_molefracs: Array1, vapor_molefracs: Array1, ) -> Self { let target: Vec<(SINumber, SINumber)> = (0..temperature.len()) .map(|_| (500.0 * JOULE / MOL, 500.0 * JOULE / MOL)) .collect(); let mut input = Vec::new(); for (((&xi, &yi), t), p) in liquid_molefracs .iter() .zip(vapor_molefracs.iter()) .zip(temperature.into_iter()) .zip(pressure.into_iter()) { input.push((xi, yi, t, p)); } assert_eq!(input.len(), target.len()); Self { input, target } } } impl DataSet for BinaryVleChemicalPotential { type Input = (f64, f64, SINumber, SINumber); type Target = (SINumber, SINumber); fn target(&self) -> &[Self::Target] { &self.target } fn input(&self) -> &[Self::Input] { &self.input } #[inline] fn prepare(&mut self, eos: &Arc) -> Result<(), EstimatorError> { Ok(()) } fn predict_datapoint( &self, eos: &Arc, input: &Self::Input, ) -> Result { let xi = input.0; let yi = input.1; let t = input.2; let p = input.3; let liquid_moles = arr1(&[xi, 1.0 - xi]) * MOL; let liquid = State::new_npt(eos, t, p, &liquid_moles, DensityInitialization::Liquid)?; let mu_liquid = liquid.chemical_potential(Contributions::Total); let vapor_moles = arr1(&[yi, 1.0 - yi]) * MOL; let vapor = State::new_npt(eos, t, p, &vapor_moles, DensityInitialization::Vapor)?; let mu_vapor = vapor.chemical_potential(Contributions::Total); Ok(( mu_liquid.get(0) - mu_vapor.get(0) + 500.0 * JOULE / MOL, mu_liquid.get(1) - mu_vapor.get(1) + 500.0 * JOULE / MOL, )) } fn relative_difference_datapoint( &self, eos: &Arc, input: &Self::Input, target: &Self::Target, ) -> Result<[f64; 2], EstimatorError> { self.predict_datapoint(eos, input).and_then(|prediction| { Ok([ ((target.0 - prediction.0) / target.0).into_value()?, ((target.1 - prediction.1) / target.1).into_value()?, ]) }) } } ```

Option 3: Add parallel version of `predict` leaving the implementation to the user

Advantage:
- requires the minimal amount of change to the current DataSet trait
- flexible
- Estimator stays almost the same
Disadvantage:
- the user has to implement the parallel iterators
- code has to be written twice

prehner commented 2 years ago

Very nice overview!

Two quick comments without having gone through all the code:

If having a mixture of SINumber and f64 is an issue, we can circumvent that by multiplying, e.g., with MOL/MOL. A bit hacky but could be worth it.
Personally, I wouldn't mind having an enum with DataSets. We "established" the approach with the equations of state and functionals, and in Python it is already basically treated as enum. But having different values for the associated types would break that concept, wouldn't it?

g-bauer commented 2 years ago

But having different values for the associated types would break that concept, wouldn't it?

Mh not sure if I understand correctly but we could do essentially what we currently do with EosVariant:

define an enum, say DataSetVariant, that contains all possible DataSet implementations
~~implement DataSet for that enum~~it's not object safe
possibly use a derive macro to do that?

Or we define the enum without implementing the trait and just write the match statements within the methods of the Estimator that stores a Vec<DataSetVariant>. ~~Both should work.~~

g-bauer commented 2 years ago

If having a mixture of SINumber and f64 is an issue, we can circumvent that by multiplying, e.g., with MOL/MOL. A bit hacky but could be worth it.

That is part of the problem IMO. The other problem is that the number of return values for a single predict call can be different, i.e. one value for vapor pressure but two values for the chemical potentials when binary VLE's are considered.

feos-org / feos