Open cgraeser opened 4 weeks ago
This is marked as draft, because there are several things that need to be discussed, details that may need to be changed, and features that may need to be added. This includes the following topics:
adolc
, the naming of types uses CamelCase while adolc uses lower case elsewhere.ADValues
are created and initialized and derivatives are read. Currently the whole stored jet is exposed publicly. I'd propose to only expose individual partial derivatives in order to be more flexible for changing the internal data structure later on, e.g. to use a Taylor-based implementation for higher order.USE_BOOST_POOL
macro has to be set manually so far.Notice that the test failure is unrelated, since this pull request does not change anything about the existing code and the failure also shows up in current master.
In general one would expect that it is beneficial to exploit symmetry for second order derivatives. To my surprise this is slower for static dimension. My first guess on why computing only one triangle of the Hessian is slower was, that this is due to the conditional index flip when requesting values. It turns out that this is not the case. Even if one only queries the computed triangle and removes the index flip this is still slower. To be precise, using
for(int i=0; i<dim; ++i)
for(int j=0; j<=dim; ++j)
...
did always produce faster code than
for(int i=0; i<dim; ++i)
for(int j=i; j<=dim; ++j)
...
Surprisingly (in the static dim case) the following is still faster than the latter but not fully on par with the former
for(int i=0; i<dim; ++i)
for(int j=0; j<=dim; ++j)
if (j>=i)
...
I guess that the code for computing the full matrix benefits from auto-vectorization using SIMD operations. In my tests (local Laplace assembler as Hessian of a local Dirichlet energy) the difference was almost a factor two (with static dim), i.e. this aspect is significant.
For dynamic dimension the situation is different: Here exploiting symmetry was measurably faster.
Another performance observation:
One can always compute higher order derivatives using nested ADValues
. Simply using a nested 1st order dim-variate ADValue
to compute the Hessian is slow. But interestingly you can even compute mixed 2nd order derivatives using nested univariate ADValue
s if you initialize the derivatives appropriately. Then you do one f-evaluation per derivative:
// Raw input vector
using Vector = std::array<double,dim>;
// Nested 1st order univariate AD-value to compute mixed second order derivatives
using Nested = ADValue<ADValue<double,1,1>, 1, 1>;
// AD-aware vector
using ADVector = std::array<NestedAD, dim>;
Vector x = ...
ADVector x_ad;
for(int i=0; i<dim; ++i)
for(int j=i; j<=dim; ++j)
{
// Initialize values of AD-aware vector
for(int k=0; k<=dim; ++k)
x_ad[k] = x[k];
// Track (i,j)-th derivative
x_ad[i].partial(0).partial() = 1;
x_ad[j].partial().partial(0) = 1;
auto y = f_raw(x_ad);
hessian[i][j] = y.partial(0).partial(0);
hessian[j][i] = hessian[i][j];
}
Surprisingly this was in my tests on par with the dynamic size 2nd order ADValue<double,2,dynamicDim>
if you the latter also exploits symmetry. For me this put's a little bit in question if one really needs the (internally significantly more complicated) dynamic dimension version.
I guess that the code for computing the full matrix benefits from auto-vectorization using SIMD operations. In my tests (local Laplace assembler as Hessian of a local Dirichlet energy) the difference was almost a factor two (with static dim), i.e. this aspect is significant.
Hmmm interesting. Did your tests include higher dimensions and more complex functions? I'm not sure if I understand what your actual test function looks like. I would expect this to not hold in general. Otherwise, it would be something to keep in mind for the optimization of ADOL-C. I would further expect that this behavior changes for higher derivative (>2) tensors as well.
Since we currently dont have a higher-order tape-less approach, its very cool that your approach can compute higher-order derivatives. At some point this could be replaced by an interface without having to nest all types, as with the tape-based method of ADOL-C.
I only tested 2nd order (more is not supported by ADValue
currently). But since there's more duplicate derivatives for higher order you're probably right.
I basically used two test cases:
ADValue
should support (various combinations of arithmetic operations with constants/non-constant values, default construction, conditionals, interaction with custom matrix types, ...). This currently takes the following form, where Dune::FieldMatrix<T,n,m>
is a simple statically sized n
xm
matrix with scalar entries of type T
:
auto f_raw = [](auto x) {
using std::sin;
using std::cos;
using std::exp;
using std::log;
using std::pow;
using std::fabs;
using F = std::decay_t<decltype(x[0])>;
F c1;
c1 = 13;
F c2(42);
F c3;
c3 = 1;
Dune::FieldMatrix<F, 3, 3> A;
for(auto i : Dune::range(3))
for(auto j : Dune::range(3))
A[i][j] = sin(x[0]*i + cos(x[1])) + exp(cos(j*x[0]*x[1]))/(1+x[0]*x[0]+x[1]*x[1]) + 2/(1+x[0]*x[0]+x[1]*x[1]);
F z = 1;
z *= pow(A.determinant(), 3) + pow(2, sin(x[0]*x[1])) + c1 + c2 + pow(sin(x[0])+2., cos(x[1])+2);
z *= c3;
if ((x[0]>0) and (x[0] <= x[1]))
return F(z);
else
return F(0);
};
Since we currently dont have a higher-order tape-less approach, its very cool that your approach can compute higher-order derivatives. At some point this could be replaced by an interface without having to nest all types, as with the tape-based method of ADOL-C.
This was initially developed within the Dune module dune-fufem (https://gitlab.dune-project.org/fufem/dune-fufem).
The class
adolc::ADValue<T, maxOrder,dim>
implementes a traceless forward mode templated with respect to the underlying typeT
, the maximal tracked derivative ordermaxOrder
and the domain dimensiondim
. Currently onlymaxOrder<=2
is implemented.Using the special value
dynamicDim
allows to use a runtime dimension. In the latter case the default is to usestd::vector
as storage. If support forboost::pool
is activated, this is used instead.Notable differences to
adtl::adouble
are:T=double
e.g. allows for low/high precision or interval arithmetics.boost:pool
s per dimension instead of a single one for a globally fixed dimension. This also avoids the memory leaks ofadtl::adouble
when changing the global dimension. Since the pools arethread_local
instead of global, this again provides thread-safety (X).The last point could probably also be implemented for
adtl::adouble
.(x) "Thread-safety" hear does not mean that concurrent access to a single
ADValue
object is safe. Instead if means that concurrent accesss to different objects in different threads is safe and not prone to race-conditions due to internally used global variables.