coin-or / ADOL-C

A Package for Automatic Differentiation of Algorithms Written in C/C++
Other
146 stars 31 forks source link

[draft] Introduce the ADValue class for templated traceless forward mode #64

Open cgraeser opened 4 weeks ago

cgraeser commented 4 weeks ago

This was initially developed within the Dune module dune-fufem (https://gitlab.dune-project.org/fufem/dune-fufem).

The class adolc::ADValue<T, maxOrder,dim> implementes a traceless forward mode templated with respect to the underlying type T, the maximal tracked derivative order maxOrder and the domain dimension dim. Currently only maxOrder<=2 is implemented.

Using the special value dynamicDim allows to use a runtime dimension. In the latter case the default is to use std::vector as storage. If support for boost::pool is activated, this is used instead.

Notable differences to adtl::adouble are:

The last point could probably also be implemented for adtl::adouble.

(x) "Thread-safety" hear does not mean that concurrent access to a single ADValue object is safe. Instead if means that concurrent accesss to different objects in different threads is safe and not prone to race-conditions due to internally used global variables.

cgraeser commented 4 weeks ago

This is marked as draft, because there are several things that need to be discussed, details that may need to be changed, and features that may need to be added. This includes the following topics:

cgraeser commented 3 weeks ago

Notice that the test failure is unrelated, since this pull request does not change anything about the existing code and the failure also shows up in current master.

cgraeser commented 2 weeks ago

In general one would expect that it is beneficial to exploit symmetry for second order derivatives. To my surprise this is slower for static dimension. My first guess on why computing only one triangle of the Hessian is slower was, that this is due to the conditional index flip when requesting values. It turns out that this is not the case. Even if one only queries the computed triangle and removes the index flip this is still slower. To be precise, using

for(int i=0; i<dim; ++i)
  for(int j=0; j<=dim; ++j)
    ...

did always produce faster code than

for(int i=0; i<dim; ++i)
  for(int j=i; j<=dim; ++j)
    ...

Surprisingly (in the static dim case) the following is still faster than the latter but not fully on par with the former

for(int i=0; i<dim; ++i)
  for(int j=0; j<=dim; ++j)
    if (j>=i)
      ...

I guess that the code for computing the full matrix benefits from auto-vectorization using SIMD operations. In my tests (local Laplace assembler as Hessian of a local Dirichlet energy) the difference was almost a factor two (with static dim), i.e. this aspect is significant.

For dynamic dimension the situation is different: Here exploiting symmetry was measurably faster.

cgraeser commented 2 weeks ago

Another performance observation:

One can always compute higher order derivatives using nested ADValues. Simply using a nested 1st order dim-variate ADValue to compute the Hessian is slow. But interestingly you can even compute mixed 2nd order derivatives using nested univariate ADValues if you initialize the derivatives appropriately. Then you do one f-evaluation per derivative:

// Raw input vector
using Vector = std::array<double,dim>;

// Nested 1st order univariate AD-value to compute mixed second order derivatives
using Nested = ADValue<ADValue<double,1,1>, 1, 1>;

// AD-aware vector
using ADVector = std::array<NestedAD, dim>;

Vector x = ...
ADVector x_ad;
for(int i=0; i<dim; ++i)
  for(int j=i; j<=dim; ++j)
  {
    // Initialize values of AD-aware vector
    for(int k=0; k<=dim; ++k)
       x_ad[k] = x[k];
    // Track (i,j)-th derivative
    x_ad[i].partial(0).partial() = 1;
    x_ad[j].partial().partial(0) = 1;
    auto y = f_raw(x_ad);
    hessian[i][j] = y.partial(0).partial(0);
    hessian[j][i] = hessian[i][j];
  }

Surprisingly this was in my tests on par with the dynamic size 2nd order ADValue<double,2,dynamicDim> if you the latter also exploits symmetry. For me this put's a little bit in question if one really needs the (internally significantly more complicated) dynamic dimension version.

TimSiebert1 commented 2 weeks ago

I guess that the code for computing the full matrix benefits from auto-vectorization using SIMD operations. In my tests (local Laplace assembler as Hessian of a local Dirichlet energy) the difference was almost a factor two (with static dim), i.e. this aspect is significant.

Hmmm interesting. Did your tests include higher dimensions and more complex functions? I'm not sure if I understand what your actual test function looks like. I would expect this to not hold in general. Otherwise, it would be something to keep in mind for the optimization of ADOL-C. I would further expect that this behavior changes for higher derivative (>2) tensors as well.

TimSiebert1 commented 2 weeks ago

Since we currently dont have a higher-order tape-less approach, its very cool that your approach can compute higher-order derivatives. At some point this could be replaced by an interface without having to nest all types, as with the tape-based method of ADOL-C.

cgraeser commented 2 weeks ago

I only tested 2nd order (more is not supported by ADValue currently). But since there's more duplicate derivatives for higher order you're probably right.

I basically used two test cases:

  1. Compute Hessian of the local Dirichlet energy $f(v) = 1/2\inte |\sum{i=1}^8 v_i \lambda_i|^2$ where $v$ are the coefficients wrt the first order finite element shape functions $(\lambda_i)$ on a hexahedron $e$.
  2. Compute gradient and Hessian of a nested nonlinear binary function designed to incorporate many "features" that ADValue should support (various combinations of arithmetic operations with constants/non-constant values, default construction, conditionals, interaction with custom matrix types, ...). This currently takes the following form, where Dune::FieldMatrix<T,n,m> is a simple statically sized nxm matrix with scalar entries of type T:
    auto f_raw = [](auto x) {
      using std::sin;
      using std::cos;
      using std::exp;
      using std::log;
      using std::pow;
      using std::fabs;
      using F = std::decay_t<decltype(x[0])>;
      F c1;
      c1 = 13;
      F c2(42);
      F c3;
      c3 = 1;
      Dune::FieldMatrix<F, 3, 3> A;
      for(auto i : Dune::range(3))
        for(auto j : Dune::range(3))
          A[i][j] = sin(x[0]*i + cos(x[1])) + exp(cos(j*x[0]*x[1]))/(1+x[0]*x[0]+x[1]*x[1]) + 2/(1+x[0]*x[0]+x[1]*x[1]);
      F z = 1;
      z *= pow(A.determinant(), 3) + pow(2, sin(x[0]*x[1])) + c1 + c2 + pow(sin(x[0])+2., cos(x[1])+2);
      z *= c3;
      if ((x[0]>0) and (x[0] <= x[1]))
        return F(z);
      else
        return F(0);
    };
cgraeser commented 2 weeks ago

Since we currently dont have a higher-order tape-less approach, its very cool that your approach can compute higher-order derivatives. At some point this could be replaced by an interface without having to nest all types, as with the tape-based method of ADOL-C.