Closed ianhbell closed 9 months ago
We have a simple caching mechanism inside a State
object. A State
is immutable once created, i.e. N,V,T cannot change.
Every time a property is calculated for a State
, we check if the necessary (partial) derivative is already in the cache or not. This means that the order in which properties are calculated can have an impact on evaluation speed. For example, when you calculate the second derivative w.r.t volume, a pressure calculation afterwards is "free". We try to consider this when we write algorithms or implement new properties.
At the moment, we naively check partial derivatives in the cache - a more effective way may be implemented but we didn't put any work into that, yet.
we have been able to get the complete set of derivatives of order 2 (00, 01, 10, 02, 11, 02, with the indices corresponding to how many derivatives of (1/T) and D respectively) completed in the equivalent time of 5 calls to the underlying Helmholtz energy function
In our code, this would entail three calls with different dual numbers, if I understand the above correctly. Once with a HyperDual
number to get 11 (and 00, 01, 10 inherently) and two calls with a Dual2
to get 02 and 20, respectively. The cache would not be utilized in these calculations.
Does that make sense?
If that set of properties is crucial, it should be straightforward to define a struct/class that contains precisely those 6 derivatives. Calculating every derivative in a single pass of the Helmholtz energy should eliminate all redundancies and therefore perfome best. However, with larger "dual" numbers, we sometimes encountered sudden performance losses depending on the hardware. At that point the behavior depends a lot on what optimizations the Rust compiler and LLVM do to the machine code, something that is pretty much out of our control.
It's a similar story in teqp. One call for 11 with autodiff::Dual that gives 00, 01, 10, and 11. Another with the autodiff::Real type that gives you 00, 01, and 02, and a third with autodiff::Real that does 00, 10, 20. So in both our libraries we have similar overlap for the 00, 10, and 01 terms which are calculated twice (or three times in the case of 00). I was just wondering if/how we can do better
How do you determine a second derivative with autodiff::Real
?
Its actually an autodiff::Real
template. Here is an example: https://github.com/usnistgov/teqp/blob/7a69fc0b78dc74c8acdbeed04a157473d7dabade/include/teqp/derivs.hpp#L147 . Sorry for the C++ cruft, but I suspect similar in Rust.
Did you check different systems (pure substance, binary mixtures, ternary mixtures, different interactions)? In my experience, they all influence the relative slowdown between different dual number types.
On my desktop CPU, for the system we also use in our paper (CO2 + methane), using the current main branch and running the dual_numbers
benchmark results in:
where the numbers are evaluation time with dual number / evaluation time with f64
.
Using the results from our paper, the slowdown is higher: x4.65
I haven't been very exhaustive in my analysis, but the relative comparisons seem to be relatively stable from what I looked at, from very simple functions that take nanoseconds to evaluate up to functions that take microseconds to evaluate (more complex EOS). But it would be a good idea to look into this a bit more. For the fast calculations, how you time these things is a bit of an art too. I have been quite happy with the microbenchmark facilities in the Catch2 testing framework in C++, I suspect something similarly advanced exists in Rust.
What EOS are you comparing? I can do the same benchmark on my side.
PC-SAFT including the quadrupolar term for CO2.
Its actually an
autodiff::Real
template. Here is an example: https://github.com/usnistgov/teqp/blob/7a69fc0b78dc74c8acdbeed04a157473d7dabade/include/teqp/derivs.hpp#L147 . Sorry for the C++ cruft, but I suspect similar in Rust.
Thanks, if I understand the autodiff source code correctly (I probably don't), I get the feeling that autodiff::Real is more equal to what we do (eagerly evaluate every derivative) whereas autodiff::dual
does a lazy evaluation, a bit like reverse mode AD.
I think then the mapping should be:
feos : teqp
dual
: autodiff::Real<T>
hyperdual
: autodiff::Dual<T>
@allanleal can speak to the details of the library, I am just a user
Just to avoid confusion. When I talked about a Dual2
, I meant a dual number with one real and two non-real elements that can be used to efficiently calculate second order (non-mixed) partial derivatives (source). The corresponding entry in the table of our paper is dual2
.
HyperDual
is clear, I think. It has 4 entries and is used for mixed order 2 partial derivatives.
Yes, I think we are all on the same page, but the libraries use slightly different naming
Hi everyone (thanks @ianhbell for pinging me). I'll try to summarize quickly here some details of autodiff. The forward mode number types in autodiff based on autodiff::Dual<V, G>
and autodiff::Real<N, T>
can be assigned any order at compile time, but are implemented differently and serve different purposes.
autodiff::Dual<V, G>
is used to create recursive dual numbers and can calculate all higher-order derivatives up to the assigned number order, where V
and G
are the types of the underlying value and gradient of the number. All operations with these dual numbers generate a template expression that is lazily evaluated and optimized at compile time whenever possible.autodiff::Real<N, T>
was implemented specifically for higher-order directional derivatives. Given an order N
, operations with these numbers will produce directional derivatives up to that order. There is no recursion here. The implementation relies heavily on template metaprogramming to generate the code needed at compile time to calculate derivatives on each arithmetic and mathematical operation.If I had more time to work on autodiff, I would:
autodiff::Dual
in a non-recursive approach, similar to autodiff::Real
; andautodiff::Real
, as in autodiff::Dual
, to squeeze further performance.Here are some convenient type aliases for such numbers:
using real0th = Real<0, double>;
using real1st = Real<1, double>;
using real2nd = Real<2, double>;
using real3rd = Real<3, double>;
using real4th = Real<4, double>;
using dual0th = HigherOrderDual<0, double>;
using dual1st = HigherOrderDual<1, double>;
using dual2nd = HigherOrderDual<2, double>;
using dual3rd = HigherOrderDual<3, double>;
using dual4th = HigherOrderDual<4, double>;
where HigherOrderDual
is an auxiliary type trait that generates the necessary recursive dual number (e.g., dual2nd = Dual<Dual<double, double>, Dual<double, double>>
).
autodiff started as an experiment for automatic differentiation algorithms so that I could decide what would be more appropriate for use in Reaktoro.
It would be interesting to have FeOs and teqp used in Reaktoro (e.g., providing equations of state for fluid phases) when modeling reactive processes with chemical equilibrium and/or chemical kinetics.
Congratulations on the nice work.
To add some benchmarking info, I added a test in the code: https://github.com/usnistgov/teqp/commit/efe325966af8ba1f09c305bf6194832c5646ff1e
Results are:
-------------------------------------------------------------------------------
Benchmark CO2 with polar PC-SAFT model
-------------------------------------------------------------------------------
/Users/ihb/Documents/Code/teqp/src/tests/catch_test_SAFTpolar.cxx:248
...............................................................................
benchmark name samples iterations est run time
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
alphar 100 77 1.6401 ms
213.198 ns 211.78 ns 218.788 ns
12.0162 ns 0.445025 ns 28.0808 ns
Ar11 100 27 1.647 ms
624.413 ns 615.893 ns 644.751 ns
62.967 ns 32.6421 ns 118.971 ns
Ar02 100 36 1.6524 ms
467.905 ns 457.373 ns 498.483 ns
81.7244 ns 28.2685 ns 176.604 ns
Ar20 100 30 1.623 ms
538.251 ns 536.307 ns 545.931 ns
18.381 ns 1.66004 ns 43.6788 ns
So while the code is generally a fair bit faster (I'm on a newish Apple M2), the slowdown with PC-SAFT is worse than I am familiar with with the multi-fluid model (like what is used in GERG 2008). I think the slowdown is similar to what was in the published paper though that was for normal PC-SAFT:
That's interesting. As you can see from our paper, our slowdowns are not as drastic. The current main
branch yields even lower slowdowns (dual: 1.1, dual2: 1.3, hyperdual: 1.46, dual3: 1.48).
Do you have an idea why the timings for Ar02
and Ar20
are so different? In my tests they yield the same timings (within uncertainties). It's not intuitive why different EoS (should) yield so different slowdowns.
Yes, your slowdown is impressive, although the raw speed seems to be a bit lower (but we don't have a fair processor comparison, these M2 chips are very fast). The timing I have with the multi-fluid model from CO2 from Span and Wagner is much more inline with your timing:
-------------------------------------------------------------------------------
Benchmark CO2 with Span and Wagner model
-------------------------------------------------------------------------------
/Users/ihb/Documents/Code/teqp/src/tests/catch_test_multifluid.cxx:52
...............................................................................
benchmark name samples iterations est run time
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
alphar 100 49 1.666 ms
339.753 ns 339.498 ns 340.477 ns
2.02404 ns 0.848428 ns 4.40505 ns
Ar11 100 22 1.6786 ms
771.78 ns 762.174 ns 798.313 ns
73.0253 ns 19.98 ns 149.169 ns
Ar02 100 36 1.6488 ms
456.528 ns 455.614 ns 460.151 ns
8.3887 ns 1.26716 ns 19.8124 ns
Ar20 100 38 1.6416 ms
438.159 ns 433.357 ns 446.533 ns
31.4698 ns 20.2123 ns 45.1856 ns
so I think a lot of this comes down to details of implementations, for instance how many copies of things end up getting made, rather than eliding copies, and that could well be different for my PC-SAFT and multifluid implementations; they are quite different models.
teqp operates with 1/T as the temperature variable, and so there are a few reciprocal operations that need to be done and un-done with autodiff
types, and the timing is slightly different as a result. Perhaps there are some optimizations that I could do to improve things there. Now that I think about it, I think I could calculate the derivatives with T as independent variable and then do the post-processing to get the right 1/T derivatives. The recipe is in the book from Roland Span. But in the multiparameter case the slowdown of Ar20 is not present, so I dunno.
Hi @allanleal, thank you for the detailed explanations. Interesting that you say, you started developping autodiff as a tool for Reaktoro, it is a similar story for us with the num-dual package and FeOs.
Combining either teqp or FeOs with Reaktoro sounds like a logical step. Of course, I always like to find applications for our work, and we have discussed adding a C interface on various occasions. But I guess I have to admit that linking the two C++ libraries is probably the more logical step.
In any case, I would be happy to discuss about the development of automatic differentiation and thermo/chemistry software. Especially, since (if I'm not mistaken) I pretty much work in the building next to your office.
The raw speed is difficult to compare - the easiest way is to run both codes on the same machine. But it is interesting to see how different the same mathematical/numerical methods turn out in practice.
Which timing did you compare your results to? The ones in our paper (from which the above table is copied) are for a binary mixture of methane / CO2 (and those timings don't even hold up to the current main branch).
Just to avoid confusion. When I talked about a
Dual2
, I meant a dual number with one real and two non-real elements that can be used to efficiently calculate second order (non-mixed) partial derivatives (source). The corresponding entry in the table of our paper isdual2
.
Whoah, is this really the entire code in Rust for Dual2? If so, that is very impressive and makes me rethink continuing in C++
And now with the CO2+methane binary mixture with polar PC-SAFT:
Benchmark CO2+methane with polar PC-SAFT model
-------------------------------------------------------------------------------
/Users/ihb/Documents/Code/teqp/src/tests/catch_test_SAFTpolar.cxx:282
...............................................................................
benchmark name samples iterations est run time
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
alphar 100 34 1.632 ms
480.159 ns 476.972 ns 495.515 ns
30.5972 ns 1.91783 ns 72.8483 ns
Ar11 100 13 1.638 ms
1.2582 us 1.25362 us 1.27651 us
43.1085 ns 4.35259 ns 102.404 ns
Ar02 100 18 1.6668 ms
934.446 ns 924.701 ns 951.735 ns
64.4561 ns 41.1593 ns 95.2632 ns
Ar20 100 13 1.6588 ms
1.2835 us 1.2667 us 1.35932 us
151.941 ns 4.33096 ns 359.826 ns
The top-line speed is still good, but the slowdown is quite a lot worse than your results.
I am very impressed with the slowdown that you are able to achieve with the dual types you are using. Do you think there is any way to get that working in C++ in a reasonable amount of effort? Or is there a lot of rust magic involved?
As a postscript, in your benchmark, did you really supply T,p,x_1 as inputs? Or did you provide the density that matches that given pressure?
At the end of the day, since the multifluid slowdown is similar in general to yours (4.8x for all second derivatives), I suspect that there is additional optimization I can do to the polar PC-SAFT implementation to make the derivatives slowdown better. I have to say I am not really sure what that would entail or where to start looking for optimizations though.
Especially, since (if I'm not mistaken) I pretty much work in the building next to your office.
Hi @prehner , we should definitely meet any other day. Yes, your building at ETH is just next to mine. Happy to learn more about what you are doing and tell you a bit what I do.
Just had a look at num-dual. Questions:
clone
method below? Is there any performance hit because of this?impl<'a, 'b, T: DualNum<F>, F: Float> Mul<&'a HyperHyperDual<T, F>> for &'b HyperHyperDual<T, F> {
type Output = HyperHyperDual<T, F>;
#[inline]
fn mul(self, rhs: &HyperHyperDual<T, F>) -> HyperHyperDual<T, F> {
HyperHyperDual::new(
self.re.clone() * &rhs.re,
self.eps1.clone() * &rhs.re + self.re.clone() * &rhs.eps1,
self.eps2.clone() * &rhs.re + self.re.clone() * &rhs.eps2,
self.eps3.clone() * &rhs.re + self.re.clone() * &rhs.eps3,
self.eps1eps2.clone() * &rhs.re
+ self.eps1.clone() * &rhs.eps2
+ self.eps2.clone() * &rhs.eps1
+ self.re.clone() * &rhs.eps1eps2,
self.eps1eps3.clone() * &rhs.re
+ self.eps1.clone() * &rhs.eps3
+ self.eps3.clone() * &rhs.eps1
+ self.re.clone() * &rhs.eps1eps3,
self.eps2eps3.clone() * &rhs.re
+ self.eps2.clone() * &rhs.eps3
+ self.eps3.clone() * &rhs.eps2
+ self.re.clone() * &rhs.eps2eps3,
self.eps1eps2eps3.clone() * &rhs.re
+ self.eps1.clone() * &rhs.eps2eps3
+ self.eps2.clone() * &rhs.eps1eps3
+ self.eps3.clone() * &rhs.eps1eps2
+ self.eps2eps3.clone() * &rhs.eps1
+ self.eps1eps3.clone() * &rhs.eps2
+ self.eps1eps2.clone() * &rhs.eps3
+ self.re.clone() * &rhs.eps1eps2eps3,
Code taken from: https://github.com/itt-ustutt/num-dual/blob/63003f25854ee8b7aa3d794ac9aef039f774bbaa/src/hyperhyperdual.rs#L250C1-L278C55
As a postscript, in your benchmark, did you really supply T,p,x_1 as inputs? Or did you provide the density that matches that given pressure?
No, we evaluate the Helmholtz energy and derivatives at a state that corresponds to that pressure. Otherwise, we would really be comparing apples to oranges in this thread (and FeOs would be insanely fast).
@allanleal
self.re
, self.eps1
, etc., will be simple types (floats), but it is not enforced at that point. Therefore, we need the explicit clone. During compilation, all the types are known, and Rust should be smart enough to optimize all of that.HyperHyperDual
objects. There might be some space for optimization there, but, again, at that point the effect of changes to the code on performance are difficult to predict due to all the layers of optimization that the Rust compiler and LLVM do.As a postscript, in your benchmark, did you really supply T,p,x_1 as inputs? Or did you provide the density that matches that given pressure?
No, we evaluate the Helmholtz energy and derivatives at a state that corresponds to that pressure. Otherwise, we would really be comparing apples to oranges in this thread (and FeOs would be insanely fast).
Can you point me to the place in the repo where this benchmark is?
Here are timings for pure CO2 on a M2 as well:
Thanks, we are all on the same page, I just wanted to be sure. Is there overhead in Rust with the units or is that compiled away? Units is one of the things I would like to do better.
Ok, so it looks like I have my work cut out for me in optimizing my PC-SAFT implementation. The top-line speed seems to be a hair better in teqp on the same processor, but the derivatives comparatively worse. I fear I have lots of copies that are being made all over the place in the PC-SAFT code which penalizes the derivatives more than the base evaluation.
Thanks, we are all on the same page, I just wanted to be sure. Is there overhead in Rust with the units or is that compiled away? Units is one of the things I would like to do better.
We initially had quantities in the form of structs (classes) that carried the units as additional fields. Now, the units are handled by the compiler, i.e., incorrect units are already detected at compile-time and there should be no penalty on performance. We use units in algorithms like phase equilibria and of course in the UI, but internally, for the evaluation of models, everything is converted to reduced units. For more complicated models (like PC-SAFT) the evaluation of the Helmholtz energy should dominate the performance, but still, there might be instances where unit conversions could be more efficient if they were hard-coded. In any case, we gladly accept the minuscule hit in performance for a robust, error-proof implementation of physical quantities.
My solution is naive: all units are always (ok, almost always) base SI: m, Pa, kg, K, etc. And I think the core library can stay that way, but the outer code should support units for people that want different units. So I am looking for good solutions that can still be performant, user friendly, and not muck up the code too much.
Does Rust do units itself, or with an external package?
@ianhbell I think the questions raised in this thread were answered? If so, I'd close the issue. We can re-open or create a new one if more questions come up.
yes, fine by me - it is instructive to learn from each other and see where further optimizations are possible
Does feos do any optimizations when calculating multiple derivatives from the model at the same time? With the help of @allanleal we have been able to get the complete set of derivatives of order 2 (00, 01, 10, 02, 11, 02, with the indices corresponding to how many derivatives of (1/T) and D respectively) completed in the equivalent time of 5 calls to the underlying Helmholtz energy function. That's pretty good I think. But I wonder whether you can do better in feos? And if so, what magic tricks you have applied to do so. This calculation is a pinchpoint in general flash calculation routines so any speedup has a direct knock-on effect on user code.