feos-org / feos

FeOs - A Framework for Equations of State and Classical Density Functional Theory
Other
116 stars 23 forks source link

More derivatives at the same time #221

Closed ianhbell closed 8 months ago

ianhbell commented 8 months ago

Does feos do any optimizations when calculating multiple derivatives from the model at the same time? With the help of @allanleal we have been able to get the complete set of derivatives of order 2 (00, 01, 10, 02, 11, 02, with the indices corresponding to how many derivatives of (1/T) and D respectively) completed in the equivalent time of 5 calls to the underlying Helmholtz energy function. That's pretty good I think. But I wonder whether you can do better in feos? And if so, what magic tricks you have applied to do so. This calculation is a pinchpoint in general flash calculation routines so any speedup has a direct knock-on effect on user code.

g-bauer commented 8 months ago

We have a simple caching mechanism inside a State object. A State is immutable once created, i.e. N,V,T cannot change.

Every time a property is calculated for a State, we check if the necessary (partial) derivative is already in the cache or not. This means that the order in which properties are calculated can have an impact on evaluation speed. For example, when you calculate the second derivative w.r.t volume, a pressure calculation afterwards is "free". We try to consider this when we write algorithms or implement new properties.

At the moment, we naively check partial derivatives in the cache - a more effective way may be implemented but we didn't put any work into that, yet.

g-bauer commented 8 months ago

we have been able to get the complete set of derivatives of order 2 (00, 01, 10, 02, 11, 02, with the indices corresponding to how many derivatives of (1/T) and D respectively) completed in the equivalent time of 5 calls to the underlying Helmholtz energy function

In our code, this would entail three calls with different dual numbers, if I understand the above correctly. Once with a HyperDual number to get 11 (and 00, 01, 10 inherently) and two calls with a Dual2 to get 02 and 20, respectively. The cache would not be utilized in these calculations.

Does that make sense?

prehner commented 8 months ago

If that set of properties is crucial, it should be straightforward to define a struct/class that contains precisely those 6 derivatives. Calculating every derivative in a single pass of the Helmholtz energy should eliminate all redundancies and therefore perfome best. However, with larger "dual" numbers, we sometimes encountered sudden performance losses depending on the hardware. At that point the behavior depends a lot on what optimizations the Rust compiler and LLVM do to the machine code, something that is pretty much out of our control.

ianhbell commented 8 months ago

It's a similar story in teqp. One call for 11 with autodiff::Dual that gives 00, 01, 10, and 11. Another with the autodiff::Real type that gives you 00, 01, and 02, and a third with autodiff::Real that does 00, 10, 20. So in both our libraries we have similar overlap for the 00, 10, and 01 terms which are calculated twice (or three times in the case of 00). I was just wondering if/how we can do better

prehner commented 8 months ago

How do you determine a second derivative with autodiff::Real?

ianhbell commented 8 months ago

Its actually an autodiff::Real template. Here is an example: https://github.com/usnistgov/teqp/blob/7a69fc0b78dc74c8acdbeed04a157473d7dabade/include/teqp/derivs.hpp#L147 . Sorry for the C++ cruft, but I suspect similar in Rust.

g-bauer commented 8 months ago

Did you check different systems (pure substance, binary mixtures, ternary mixtures, different interactions)? In my experience, they all influence the relative slowdown between different dual number types.

On my desktop CPU, for the system we also use in our paper (CO2 + methane), using the current main branch and running the dual_numbers benchmark results in:

where the numbers are evaluation time with dual number / evaluation time with f64. grafik Using the results from our paper, the slowdown is higher: x4.65

ianhbell commented 8 months ago

I haven't been very exhaustive in my analysis, but the relative comparisons seem to be relatively stable from what I looked at, from very simple functions that take nanoseconds to evaluate up to functions that take microseconds to evaluate (more complex EOS). But it would be a good idea to look into this a bit more. For the fast calculations, how you time these things is a bit of an art too. I have been quite happy with the microbenchmark facilities in the Catch2 testing framework in C++, I suspect something similarly advanced exists in Rust.

ianhbell commented 8 months ago

What EOS are you comparing? I can do the same benchmark on my side.

g-bauer commented 8 months ago

PC-SAFT including the quadrupolar term for CO2.

prehner commented 8 months ago

Its actually an autodiff::Real template. Here is an example: https://github.com/usnistgov/teqp/blob/7a69fc0b78dc74c8acdbeed04a157473d7dabade/include/teqp/derivs.hpp#L147 . Sorry for the C++ cruft, but I suspect similar in Rust.

Thanks, if I understand the autodiff source code correctly (I probably don't), I get the feeling that autodiff::Real is more equal to what we do (eagerly evaluate every derivative) whereas autodiff::dual does a lazy evaluation, a bit like reverse mode AD.

ianhbell commented 8 months ago

I think then the mapping should be:

feos : teqp dual : autodiff::Real<T> hyperdual : autodiff::Dual<T>

@allanleal can speak to the details of the library, I am just a user

g-bauer commented 8 months ago

Just to avoid confusion. When I talked about a Dual2, I meant a dual number with one real and two non-real elements that can be used to efficiently calculate second order (non-mixed) partial derivatives (source). The corresponding entry in the table of our paper is dual2.

HyperDual is clear, I think. It has 4 entries and is used for mixed order 2 partial derivatives.

ianhbell commented 8 months ago

Yes, I think we are all on the same page, but the libraries use slightly different naming

allanleal commented 8 months ago

Hi everyone (thanks @ianhbell for pinging me). I'll try to summarize quickly here some details of autodiff. The forward mode number types in autodiff based on autodiff::Dual<V, G> and autodiff::Real<N, T> can be assigned any order at compile time, but are implemented differently and serve different purposes.

If I had more time to work on autodiff, I would:

Here are some convenient type aliases for such numbers:

using real0th = Real<0, double>;
using real1st = Real<1, double>;
using real2nd = Real<2, double>;
using real3rd = Real<3, double>;
using real4th = Real<4, double>;
using dual0th = HigherOrderDual<0, double>;
using dual1st = HigherOrderDual<1, double>;
using dual2nd = HigherOrderDual<2, double>;
using dual3rd = HigherOrderDual<3, double>;
using dual4th = HigherOrderDual<4, double>;

where HigherOrderDual is an auxiliary type trait that generates the necessary recursive dual number (e.g., dual2nd = Dual<Dual<double, double>, Dual<double, double>>).

autodiff started as an experiment for automatic differentiation algorithms so that I could decide what would be more appropriate for use in Reaktoro.

It would be interesting to have FeOs and teqp used in Reaktoro (e.g., providing equations of state for fluid phases) when modeling reactive processes with chemical equilibrium and/or chemical kinetics.

Congratulations on the nice work.

ianhbell commented 8 months ago

To add some benchmarking info, I added a test in the code: https://github.com/usnistgov/teqp/commit/efe325966af8ba1f09c305bf6194832c5646ff1e

Results are:

-------------------------------------------------------------------------------
Benchmark CO2 with polar PC-SAFT model
-------------------------------------------------------------------------------
/Users/ihb/Documents/Code/teqp/src/tests/catch_test_SAFTpolar.cxx:248
...............................................................................

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
alphar                                         100            77     1.6401 ms 
                                        213.198 ns     211.78 ns    218.788 ns 
                                        12.0162 ns   0.445025 ns    28.0808 ns 

Ar11                                           100            27      1.647 ms 
                                        624.413 ns    615.893 ns    644.751 ns 
                                         62.967 ns    32.6421 ns    118.971 ns 

Ar02                                           100            36     1.6524 ms 
                                        467.905 ns    457.373 ns    498.483 ns 
                                        81.7244 ns    28.2685 ns    176.604 ns 

Ar20                                           100            30      1.623 ms 
                                        538.251 ns    536.307 ns    545.931 ns 
                                         18.381 ns    1.66004 ns    43.6788 ns 

So while the code is generally a fair bit faster (I'm on a newish Apple M2), the slowdown with PC-SAFT is worse than I am familiar with with the multi-fluid model (like what is used in GERG 2008). I think the slowdown is similar to what was in the published paper though that was for normal PC-SAFT:

image
g-bauer commented 8 months ago

That's interesting. As you can see from our paper, our slowdowns are not as drastic. The current main branch yields even lower slowdowns (dual: 1.1, dual2: 1.3, hyperdual: 1.46, dual3: 1.48).

Do you have an idea why the timings for Ar02 and Ar20 are so different? In my tests they yield the same timings (within uncertainties). It's not intuitive why different EoS (should) yield so different slowdowns.

ianhbell commented 8 months ago

Yes, your slowdown is impressive, although the raw speed seems to be a bit lower (but we don't have a fair processor comparison, these M2 chips are very fast). The timing I have with the multi-fluid model from CO2 from Span and Wagner is much more inline with your timing:

-------------------------------------------------------------------------------
Benchmark CO2 with Span and Wagner model
-------------------------------------------------------------------------------
/Users/ihb/Documents/Code/teqp/src/tests/catch_test_multifluid.cxx:52
...............................................................................

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
alphar                                         100            49      1.666 ms 
                                        339.753 ns    339.498 ns    340.477 ns 
                                        2.02404 ns   0.848428 ns    4.40505 ns 

Ar11                                           100            22     1.6786 ms 
                                         771.78 ns    762.174 ns    798.313 ns 
                                        73.0253 ns      19.98 ns    149.169 ns 

Ar02                                           100            36     1.6488 ms 
                                        456.528 ns    455.614 ns    460.151 ns 
                                         8.3887 ns    1.26716 ns    19.8124 ns 

Ar20                                           100            38     1.6416 ms 
                                        438.159 ns    433.357 ns    446.533 ns 
                                        31.4698 ns    20.2123 ns    45.1856 ns

so I think a lot of this comes down to details of implementations, for instance how many copies of things end up getting made, rather than eliding copies, and that could well be different for my PC-SAFT and multifluid implementations; they are quite different models.

teqp operates with 1/T as the temperature variable, and so there are a few reciprocal operations that need to be done and un-done with autodiff types, and the timing is slightly different as a result. Perhaps there are some optimizations that I could do to improve things there. Now that I think about it, I think I could calculate the derivatives with T as independent variable and then do the post-processing to get the right 1/T derivatives. The recipe is in the book from Roland Span. But in the multiparameter case the slowdown of Ar20 is not present, so I dunno.

prehner commented 8 months ago

Hi @allanleal, thank you for the detailed explanations. Interesting that you say, you started developping autodiff as a tool for Reaktoro, it is a similar story for us with the num-dual package and FeOs.

Combining either teqp or FeOs with Reaktoro sounds like a logical step. Of course, I always like to find applications for our work, and we have discussed adding a C interface on various occasions. But I guess I have to admit that linking the two C++ libraries is probably the more logical step.

In any case, I would be happy to discuss about the development of automatic differentiation and thermo/chemistry software. Especially, since (if I'm not mistaken) I pretty much work in the building next to your office.

g-bauer commented 8 months ago

The raw speed is difficult to compare - the easiest way is to run both codes on the same machine. But it is interesting to see how different the same mathematical/numerical methods turn out in practice.

Which timing did you compare your results to? The ones in our paper (from which the above table is copied) are for a binary mixture of methane / CO2 (and those timings don't even hold up to the current main branch).

ianhbell commented 8 months ago

Just to avoid confusion. When I talked about a Dual2, I meant a dual number with one real and two non-real elements that can be used to efficiently calculate second order (non-mixed) partial derivatives (source). The corresponding entry in the table of our paper is dual2.

Whoah, is this really the entire code in Rust for Dual2? If so, that is very impressive and makes me rethink continuing in C++

ianhbell commented 8 months ago

And now with the CO2+methane binary mixture with polar PC-SAFT:

Benchmark CO2+methane with polar PC-SAFT model
-------------------------------------------------------------------------------
/Users/ihb/Documents/Code/teqp/src/tests/catch_test_SAFTpolar.cxx:282
...............................................................................

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
alphar                                         100            34      1.632 ms 
                                        480.159 ns    476.972 ns    495.515 ns 
                                        30.5972 ns    1.91783 ns    72.8483 ns 

Ar11                                           100            13      1.638 ms 
                                         1.2582 us    1.25362 us    1.27651 us 
                                        43.1085 ns    4.35259 ns    102.404 ns 

Ar02                                           100            18     1.6668 ms 
                                        934.446 ns    924.701 ns    951.735 ns 
                                        64.4561 ns    41.1593 ns    95.2632 ns 

Ar20                                           100            13     1.6588 ms 
                                         1.2835 us     1.2667 us    1.35932 us 
                                        151.941 ns    4.33096 ns    359.826 ns 

The top-line speed is still good, but the slowdown is quite a lot worse than your results.

I am very impressed with the slowdown that you are able to achieve with the dual types you are using. Do you think there is any way to get that working in C++ in a reasonable amount of effort? Or is there a lot of rust magic involved?

ianhbell commented 8 months ago

As a postscript, in your benchmark, did you really supply T,p,x_1 as inputs? Or did you provide the density that matches that given pressure?

ianhbell commented 8 months ago

At the end of the day, since the multifluid slowdown is similar in general to yours (4.8x for all second derivatives), I suspect that there is additional optimization I can do to the polar PC-SAFT implementation to make the derivatives slowdown better. I have to say I am not really sure what that would entail or where to start looking for optimizations though.

allanleal commented 8 months ago

Especially, since (if I'm not mistaken) I pretty much work in the building next to your office.

Hi @prehner , we should definitely meet any other day. Yes, your building at ETH is just next to mine. Happy to learn more about what you are doing and tell you a bit what I do.

Just had a look at num-dual. Questions:

  1. Where are the mathematical functions implemented (e.g., exp, log, pow, sin, cos, etc.) for the dual numbers?
  2. Why are there multiple calls to clone method below? Is there any performance hit because of this?
  3. How do you avoid temporary HyperHyperDual objects (which could decrease performance)?
impl<'a, 'b, T: DualNum<F>, F: Float> Mul<&'a HyperHyperDual<T, F>> for &'b HyperHyperDual<T, F> {
    type Output = HyperHyperDual<T, F>;
    #[inline]
    fn mul(self, rhs: &HyperHyperDual<T, F>) -> HyperHyperDual<T, F> {
        HyperHyperDual::new(
            self.re.clone() * &rhs.re,
            self.eps1.clone() * &rhs.re + self.re.clone() * &rhs.eps1,
            self.eps2.clone() * &rhs.re + self.re.clone() * &rhs.eps2,
            self.eps3.clone() * &rhs.re + self.re.clone() * &rhs.eps3,
            self.eps1eps2.clone() * &rhs.re
                + self.eps1.clone() * &rhs.eps2
                + self.eps2.clone() * &rhs.eps1
                + self.re.clone() * &rhs.eps1eps2,
            self.eps1eps3.clone() * &rhs.re
                + self.eps1.clone() * &rhs.eps3
                + self.eps3.clone() * &rhs.eps1
                + self.re.clone() * &rhs.eps1eps3,
            self.eps2eps3.clone() * &rhs.re
                + self.eps2.clone() * &rhs.eps3
                + self.eps3.clone() * &rhs.eps2
                + self.re.clone() * &rhs.eps2eps3,
            self.eps1eps2eps3.clone() * &rhs.re
                + self.eps1.clone() * &rhs.eps2eps3
                + self.eps2.clone() * &rhs.eps1eps3
                + self.eps3.clone() * &rhs.eps1eps2
                + self.eps2eps3.clone() * &rhs.eps1
                + self.eps1eps3.clone() * &rhs.eps2
                + self.eps1eps2.clone() * &rhs.eps3
                + self.re.clone() * &rhs.eps1eps2eps3,

Code taken from: https://github.com/itt-ustutt/num-dual/blob/63003f25854ee8b7aa3d794ac9aef039f774bbaa/src/hyperhyperdual.rs#L250C1-L278C55

prehner commented 8 months ago

As a postscript, in your benchmark, did you really supply T,p,x_1 as inputs? Or did you provide the density that matches that given pressure?

No, we evaluate the Helmholtz energy and derivatives at a state that corresponds to that pressure. Otherwise, we would really be comparing apples to oranges in this thread (and FeOs would be insanely fast).

@allanleal

  1. Mathematical functions are implemented in https://github.com/itt-ustutt/num-dual/blob/master/src/derivatives.rs. Not the most readable part of the library, to be honest.
  2. This is because we allow for the possibility of using dynamically sized dual numbers (e.g., for derivatives w.r.t. an arbitrary number of variables). Because their size is not known at compile-time, Rust does not copy them implicitly. In most cases, self.re, self.eps1, etc., will be simple types (floats), but it is not enforced at that point. Therefore, we need the explicit clone. During compilation, all the types are known, and Rust should be smart enough to optimize all of that.
  3. I'm not sure what you mean by that. We do evaluate the derivatives eagerly, so every operation creates new HyperHyperDual objects. There might be some space for optimization there, but, again, at that point the effect of changes to the code on performance are difficult to predict due to all the layers of optimization that the Rust compiler and LLVM do.
ianhbell commented 8 months ago

As a postscript, in your benchmark, did you really supply T,p,x_1 as inputs? Or did you provide the density that matches that given pressure?

No, we evaluate the Helmholtz energy and derivatives at a state that corresponds to that pressure. Otherwise, we would really be comparing apples to oranges in this thread (and FeOs would be insanely fast).

Can you point me to the place in the repo where this benchmark is?

prehner commented 8 months ago

https://github.com/feos-org/feos/blob/main/benches/dual_numbers.rs#L107

g-bauer commented 8 months ago

Here are timings for pure CO2 on a M2 as well:

feosm2timings
ianhbell commented 8 months ago

Thanks, we are all on the same page, I just wanted to be sure. Is there overhead in Rust with the units or is that compiled away? Units is one of the things I would like to do better.

ianhbell commented 8 months ago

Ok, so it looks like I have my work cut out for me in optimizing my PC-SAFT implementation. The top-line speed seems to be a hair better in teqp on the same processor, but the derivatives comparatively worse. I fear I have lots of copies that are being made all over the place in the PC-SAFT code which penalizes the derivatives more than the base evaluation.

prehner commented 8 months ago

Thanks, we are all on the same page, I just wanted to be sure. Is there overhead in Rust with the units or is that compiled away? Units is one of the things I would like to do better.

We initially had quantities in the form of structs (classes) that carried the units as additional fields. Now, the units are handled by the compiler, i.e., incorrect units are already detected at compile-time and there should be no penalty on performance. We use units in algorithms like phase equilibria and of course in the UI, but internally, for the evaluation of models, everything is converted to reduced units. For more complicated models (like PC-SAFT) the evaluation of the Helmholtz energy should dominate the performance, but still, there might be instances where unit conversions could be more efficient if they were hard-coded. In any case, we gladly accept the minuscule hit in performance for a robust, error-proof implementation of physical quantities.

ianhbell commented 8 months ago

My solution is naive: all units are always (ok, almost always) base SI: m, Pa, kg, K, etc. And I think the core library can stay that way, but the outer code should support units for people that want different units. So I am looking for good solutions that can still be performant, user friendly, and not muck up the code too much.

ianhbell commented 8 months ago

Does Rust do units itself, or with an external package?

g-bauer commented 8 months ago

There are user-written crates (i.e. packages) that provide working with units (e.g. uom). We use our own implementation because we have some special needs (working with arrays, using reference & reduced properties, interfacing with Python).

g-bauer commented 8 months ago

@ianhbell I think the questions raised in this thread were answered? If so, I'd close the issue. We can re-open or create a new one if more questions come up.

ianhbell commented 8 months ago

yes, fine by me - it is instructive to learn from each other and see where further optimizations are possible