yizhang-yiz commented 3 years ago

I'm seeking help in figuring out the cause of performance using dopri5 from arkode and odeint on solving Lotka-Volterra equation. The ODE itself is not essential as on other problems similar issues are noticed. The benchmark code can be found at https://github.com/metrumresearchgroup/math/blob/torsten-arkode/test/unit/math/prim/functor/debug_arkode_test.cpp

To reproduce the very different wall time

bash-3.2$ git clone --recurse-submodules --single-branch --branch  torsten-arkode https://github.com/metrumresearchgroup/math.git
bash-3.2$ ./runTests.py -j4 test/unit/math/prim/functor/debug_arkode_test.cpp
------------------------------------------------------------
mpirun -np 1 test/unit/math/prim/functor/debug_arkode_test --gtest_output="xml:test/unit/math/prim/functor/debug_arkode_test.xml"
Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from arkode
[ RUN      ] arkode.lotka
[       OK ] arkode.lotka (375 ms)
[----------] 1 test from arkode (375 ms total)

[----------] 1 test from odeint
[ RUN      ] odeint.lotka
[       OK ] odeint.lotka (83 ms)
[----------] 1 test from odeint (83 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (458 ms total)
[  PASSED  ] 2 tests.

What changes can I make in the code to bring arkode performance closer to odeint? My main suspects are different error norm and adaptive controller used by the solvers. Will they cause that much difference?

balos1 commented 3 years ago

Please post the number of steps taken by ARKODE and odeint. Are you using the same error tolerances? It is hard to say what could be going on based on execution time alone, knowing the number of time steps would be helpful (you can get this from ERKStep with the function ERKStepGetNumSteps).

You might be interested in this conversation about performance differences between Julia DifferentialEquations.jl and ARKODE.

yizhang-yiz commented 3 years ago

Thanks @balos1 . I actually followed that thread. You mentioned you hacked a different weight for WRMS. How did you do that, in particular use previous two steps' results? I'm assuming somewhere I can save the current result so I can use it in ARKEwtFn but not sure where to do that.

balos1 commented 3 years ago

Implementing the other error weight requires changes in the sundials source itself, and in the case of the comparison with Julia, was unnecessary. It would be better to understand the performance difference you are reporting so that we can determine what is causing it.

Can you provide the number of time steps taken by ARKODE and odeint?

yizhang-yiz commented 3 years ago

I see. Here's the # of eval printout:

bash-3.2$ ./runTests.py -j4 test/unit/math/prim/functor/debug_arkode_test.cpp
mpirun -np 1 test/unit/math/prim/functor/debug_arkode_test --gtest_output="xml:test/unit/math/prim/functor/debug_arkode_test.xml"
Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from arkode
[ RUN      ] arkode.lotka
# of RHS evals ( arkode ) :4593101
[       OK ] arkode.lotka (349 ms)
[----------] 1 test from arkode (349 ms total)

[----------] 1 test from odeint
[ RUN      ] odeint.lotka
# of RHS evals ( odeint ) :4837436
[       OK ] odeint.lotka (79 ms)
[----------] 1 test from odeint (79 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (428 ms total)
[  PASSED  ] 2 tests.

EDIT: sorry I didn't print out the # of steps but the evals. Let me add the steps printout asap.

drreynolds commented 3 years ago

Hi Yi,

From this output it looks like ARKODE takes around 5% fewer time steps than odeint, so the runtime differences that you are seeing most likely arise due to differences in how you have implemented the right-hand function, since for these explicit methods the dominant cost comes from those evaluations. Apparently, your implementation using odeint takes approximately 1.6e-5 ms/rhs whereas your implementation using ARKODE takes around 7.6e-5 ms/rhs.

Can you provide any context about how you have implemented these two problems? Could it be that odeint defaults to using all available cores on your system, whereas your use of ARKODE uses only a single core?

Daniel R. Reynolds Professor, SMU Mathematics 214.768.4339 http://faculty.smu.edu/reynolds

From: Yi Zhang @.> Sent: Tuesday, July 27, 2021 7:03 PM To: LLNL/sundials @.> Cc: Subscribed @.***> Subject: Re: [LLNL/sundials] Understand ARKOde vs boost::odeint performance difference (#66)

[EXTERNAL SENDER]

I see. Here's the # of eval printout:

bash-3.2$ ./runTests.py -j4 test/unit/math/prim/functor/debug_arkode_test.cpp

mpirun -np 1 test/unit/math/prim/functor/debug_arkode_test --gtest_output="xml:test/unit/math/prim/functor/debug_arkode_test.xml"

Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc

[==========] Running 2 tests from 2 test suites.

[----------] Global test environment set-up.

[----------] 1 test from arkode

[ RUN ] arkode.lotka

of RHS evals ( arkode ) :4593101

[ OK ] arkode.lotka (349 ms)

[----------] 1 test from arkode (349 ms total)

[----------] 1 test from odeint

[ RUN ] odeint.lotka

of RHS evals ( odeint ) :4837436

[ OK ] odeint.lotka (79 ms)

[----------] 1 test from odeint (79 ms total)

[----------] Global test environment tear-down

[==========] 2 tests from 2 test suites ran. (428 ms total)

[ PASSED ] 2 tests.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/LLNL/sundials/issues/66#issuecomment-887911165, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC2TOOKRTBXKKMRB5DDQODTTZ5CJ3ANCNFSM5BDCZQFA.

yizhang-yiz commented 3 years ago

@drreynolds The printout above is the number of RHS evaluations, not steps. I'll add the steps printout asap.

The implementation can be found in the first post link. In particular, an overloaded functor calculates the RHS for arkode & odeint:

  void operator()(double t_in, N_Vector& y, N_Vector& ydot) {
    n_eval++;
    double alpha = theta[0];
    double beta = theta[1];
    double gamma = theta[2];
    double delta = theta[3];

    NV_Ith_S(ydot, 0) = (alpha - beta * NV_Ith_S(y, 1)) * NV_Ith_S(y, 0);
    NV_Ith_S(ydot, 1) = (-gamma + delta * NV_Ith_S(y, 0)) * NV_Ith_S(y, 1);
  }

  void operator()(const std::vector<double>& y, std::vector<double>& ydot, double t) {
    n_eval++;
    double alpha = theta[0];
    double beta = theta[1];
    double gamma = theta[2];
    double delta = theta[3];

    ydot[0] = (alpha - beta * y[1]) * y[0];
    ydot[1] = (-gamma + delta * y[0]) * y[1];
  }

There's no parallelism involved.

drreynolds commented 3 years ago

@yizhang-yiz, the fundamental metric for explicit methods is the number of RHS evaluations, so although the corresponding number of steps for each method would be interesting, the data you reported above should be the most relevant.

That said, from the codes above I do not see why the first routine takes almost 5x the execution time. For larger ODE systems we typically recommend that users avoid the "convenience macros" like NV_Ith_S and instead access the data array pointer and reference the entries directly, e.g.

double *ydata = NV_DATA_S(y);
double *yddata = NV_DATA_S(ydot);
yddata[0] = (alpha - beta * ydata[1]) * ydata[0];
yddata[1] = (-gamma + delta * ydata[0]) * ydata[1];

however this system is so small I doubt that this is the cause of the runtime difference.

yizhang-yiz commented 3 years ago

@drreynolds yeah this baffles me. Since it happens to all the equations in my benchmark (all small ODEs like this one) I'm assuming it has something to do with the way I use arkode.

One the other hand it's not clear to me whyNV_Ith_S is less efficient: isn't it just a macro of NV_DATA_S(v)[i]?

drreynolds commented 3 years ago

@yizhang-yiz: yes, you're correct that it is just a macro. That said, it's been my impression that these macros make it more difficult for the compiler to leverage data locality (if it thinks that the data is not contiguous then it may load and/or store the same page multiple times). For a point of reference, we definitely notice a slowdown when using the similar macros for dense and banded matrix accesses.

It might be worth trying an experiment using the direct data access approach that I showed above. If it speeds things up then the problem is solved; if not then hopefully some other folks who are more 'expert' in CS can chime in.

balos1 commented 3 years ago

@yizhang-yiz I don't think the issue is the use of the NV_Ith_S macro.

What exactly is being timed? Is it the entire body of the TEST functions (i.e. this one for arkode and this one for odeint)?

If those entire TEST functions are being timed then one reason for the discrepancy could be the unnecessary call to ERKStepReInit right after ERKStepCreate.

Unlike CVODE, if you are familiar with using it, the ARKODE integrators do not require calling Create and Init. Hence, why there is no ERKStepInit function, only ERKStepReInit (which has a different use case).

Furthermore, if the entire TEST functions are being timed, it could be that odeint is faster at allocating the memory it needs when the integrator is created than ARKODE. It would be interesting to see how long the actual integration takes, i.e. the loop on line 131 by itself.

yizhang-yiz commented 3 years ago

@balos1 The time is for the entire test body. ReInit is there because the test is scrapped from a large body of production code and some earlier tests show it's not cause of the lag. Thanks for pointing out ERKStepInit, but somehow I couldn't find it in the manual. I thought ERK only requires ERKStepCreate to setup.

Anyways, I've removed ReInit and added the timing for the loop. It looks like the stepping does dominate:

mpirun -np 1 test/unit/math/prim/functor/debug_arkode_test --gtest_output="xml:test/unit/math/prim/functor/debug_arkode_test.xml"
Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from arkode
[ RUN      ] arkode.lotka
ERKStep elapsed time: 0.339325s
# of RHS evals ( arkode ) :4593101
[       OK ] arkode.lotka (339 ms)
[----------] 1 test from arkode (339 ms total)

[----------] 1 test from odeint
[ RUN      ] odeint.lotka
# of RHS evals ( odeint ) :4837436
[       OK ] odeint.lotka (80 ms)
[----------] 1 test from odeint (80 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (419 ms total)
[  PASSED  ] 2 tests.

balos1 commented 3 years ago

@yizhang-yiz

Thanks for pointing out ERKStepInit, but somehow I couldn't find it in the manual. I thought ERK only requires ERKStepCreate to setup.

There is no ERKStepInit function. You are correct that you only need ERKStepCreate. I was pointing out that if you are used to CVODE, this a difference to be aware of.

OK, it is good to know that it is indeed the stepping dominating. Next, can you time the odeint RHS and ARKODE RHS functions separately as well?

yizhang-yiz commented 3 years ago

can you time the odeint RHS and ARKODE RHS functions separately as well

I thought about this but since the rhs eval costs so little and I'm afraid the timing code will obfuscate the total cost. I'm still trying to come up a way to get it right.

balos1 commented 3 years ago

can you time the odeint RHS and ARKODE RHS functions separately as well

I thought about this but since the rhs eval costs so little and I'm afraid the timing code will obfuscate the total cost. I'm still trying to come up a way to get it right.

I think it would be fine if you just timed the RHS by themselves (outside of the integrators) by calling them directly with some random data in a loop.

balos1 commented 3 years ago

@yizhang-yiz Can you also provide details on how you built SUNDIALS? Did you build it with optimizations (either by providing optimization flags or setting CMAKE_BUILD_TYPE=Release)? If you can attach your CMakeCache.txt file that would be great.

yizhang-yiz commented 3 years ago

With

TEST(arkode, rhs_eval) {
  int n = 2;
  std::vector<double> theta{1.5, 1.05, 1.5, 2.05};
  std::vector<double> y0{0.3, 0.8};
  lotka_volterra ode(theta);

  N_Vector y = N_VNew_Serial(n);
  N_Vector ydot = N_VNew_Serial(n);
  NV_Ith_S(y, 0) = y0[0];
  NV_Ith_S(y, 1) = y0[1];

  const long int n_eval = 99999999;

  std::chrono::time_point<std::chrono::system_clock> start, end;
  std::chrono::duration<double, std::milli> elapsed;
  start = std::chrono::system_clock::now();
  for (auto i = 0; i < n_eval; ++i) {
    arkode_rhs<lotka_volterra>::fn(0.1, y, ydot, static_cast<void*>(&ode));
  }
  end = std::chrono::system_clock::now();
  elapsed = (end - start);

  std::cout << "arkode RHS elapsed time: " << elapsed.count() << " ms\n";
}

TEST(odeint, rhs_eval) {
  int n = 2;
  std::vector<double> theta{1.5, 1.05, 1.5, 2.05};
  std::vector<double> y0{0.3, 0.8};
  lotka_volterra ode(theta);

  std::vector<double> ydot(n);

  const long int n_eval = 99999999;

  std::chrono::time_point<std::chrono::system_clock> start, end;
  std::chrono::duration<double, std::milli> elapsed;
  start = std::chrono::system_clock::now();
  for (long int i = 0; i < n_eval; ++i) {
    ode(y0, ydot, 0.1);
  }
  end = std::chrono::system_clock::now();
  elapsed = (end - start);

  std::cout << "odeint RHS elapsed time: " << elapsed.count() << " ms\n";
}

to repeat RHS evaluation, I'm getting

arkode RHS elapsed time: 129.86 ms
[       OK ] arkode.rhs_eval (130 ms)
[----------] 1 test from arkode (130 ms total)

[----------] 1 test from odeint
[ RUN      ] odeint.rhs_eval
odeint RHS elapsed time: 0 ms

So it looks std::vector version is much cheaper.

I'm not using cmake but gnu make:

clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode.c -o lib/sundials_5.7.0/src/arkode/arkode.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_adapt.c -o lib/sundials_5.7.0/src/arkode/arkode_adapt.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_arkstep.c -o lib/sundials_5.7.0/src/arkode/arkode_arkstep.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_arkstep_io.c -o lib/sundials_5.7.0/src/arkode/arkode_arkstep_io.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_arkstep_nls.c -o lib/sundials_5.7.0/src/arkode/arkode_arkstep_nls.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_bandpre.c -o lib/sundials_5.7.0/src/arkode/arkode_bandpre.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_bbdpre.c -o lib/sundials_5.7.0/src/arkode/arkode_bbdpre.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_butcher.c -o lib/sundials_5.7.0/src/arkode/arkode_butcher.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_butcher_dirk.c -o lib/sundials_5.7.0/src/arkode/arkode_butcher_dirk.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_butcher_erk.c -o lib/sundials_5.7.0/src/arkode/arkode_butcher_erk.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_erkstep.c -o lib/sundials_5.7.0/src/arkode/arkode_erkstep.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_erkstep_io.c -o lib/sundials_5.7.0/src/arkode/arkode_erkstep_io.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_interp.c -o lib/sundials_5.7.0/src/arkode/arkode_interp.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_io.c -o lib/sundials_5.7.0/src/arkode/arkode_io.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_ls.c -o lib/sundials_5.7.0/src/arkode/arkode_ls.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_mri_tables.c -o lib/sundials_5.7.0/src/arkode/arkode_mri_tables.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_mristep.c -o lib/sundials_5.7.0/src/arkode/arkode_mristep.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_mristep_io.c -o lib/sundials_5.7.0/src/arkode/arkode_mristep_io.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_mristep_nls.c -o lib/sundials_5.7.0/src/arkode/arkode_mristep_nls.o
clang++ -pipe   -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT  -O3 -I lib/sundials_5.7.0/include -DNO_FPRINTF_OUTPUT     -O3  -c -x c -include lib/sundials_5.7.0/include/stan_sundials_printf_override.hpp lib/sundials_5.7.0/src/arkode/arkode_root.c -o lib/sundials_5.7.0/src/arkode/arkode_root.o

What would be the equivalent of CMAKE_BUILD_TYPE=Release?

yizhang-yiz commented 3 years ago

Looks like the compiler is doing something funny on N_Vector that may cause cache miss. If I do

TEST(arkode, rhs_eval) {
  N_Vector y = N_VNew_Serial(1);
  N_Vector ydot = N_VNew_Serial(1);
  NV_Ith_S(y, 0) = 1.0;         // dummy
  NV_Ith_S(ydot, 0) = 1.0;      // dummy

  const long int n_eval = 99999;

  std::chrono::time_point<std::chrono::system_clock> start, end;
  std::chrono::duration<double, std::milli> elapsed;
  start = std::chrono::system_clock::now();
  for (auto i = 0; i < n_eval; ++i) {
    for (auto j = 0; j < n_eval; ++j) {
      NV_Ith_S(ydot, 0) = NV_Ith_S(y, 0) * NV_Ith_S(y, 0);
    }
  }
  end = std::chrono::system_clock::now();
  elapsed = (end - start);

  std::cout << "arkode RHS elapsed time: " << elapsed.count() << " ms\n";
}

I'm getting timing

arkode RHS elapsed time: 241.659 ms

But if I do the same evaluation twice in the loop

  for (auto i = 0; i < n_eval; ++i) {
    for (auto j = 0; j < n_eval; ++j) {
      NV_Ith_S(ydot, 0) = NV_Ith_S(y, 0) * NV_Ith_S(y, 0);
      NV_Ith_S(ydot, 0) = NV_Ith_S(y, 0) * NV_Ith_S(y, 0);
    }
  }

I'm getting

arkode RHS elapsed time: 5527.08 ms

yizhang-yiz commented 3 years ago

Some further tests along this line. When using optimization flag O=0 and O=1, the above test of evaluating N_Vector statement once vs twice within the loop gives sensible results: one evaluation costs ~half the time of two evaluations. This is no longer the case when O=2 or O=3, where two evaluations costs 20x.

I'm using clang on macos:

bash-3.2$ clang++ --version
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
bash-3.2$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.14.6
BuildVersion:   18G9216

Let me get on a linux machine to see if this happens on gcc.

yizhang-yiz commented 3 years ago

The above issue seems gone on linux: O=3 gives consistent timing as O=0 & O=1. One the other hand the performance difference is still there, regardless of O flag: O=1

test/unit/math/prim/functor/debug_arkode_test --gtest_output="xml:test/unit/math/prim/functor/debug_arkode_test.xml"
Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from arkode
[ RUN      ] arkode.lotka
ERKStep elapsed time: 377.113 ms
# of RHS evals ( arkode ) :4593101
[       OK ] arkode.lotka (377 ms)
[----------] 1 test from arkode (377 ms total)

[----------] 1 test from odeint
[ RUN      ] odeint.lotka
# of RHS evals ( odeint ) :4837436
[       OK ] odeint.lotka (67 ms)
[----------] 1 test from odeint (67 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (444 ms total)
[  PASSED  ] 2 tests.

O=3

test/unit/math/prim/functor/debug_arkode_test --gtest_output="xml:test/unit/math/prim/functor/debug_arkode_test.xml"
Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from arkode
[ RUN      ] arkode.lotka
ERKStep elapsed time: 382.897 ms
# of RHS evals ( arkode ) :4593101
[       OK ] arkode.lotka (383 ms)
[----------] 1 test from arkode (383 ms total)

[----------] 1 test from odeint
[ RUN      ] odeint.lotka
# of RHS evals ( odeint ) :4837436
[       OK ] odeint.lotka (70 ms)
[----------] 1 test from odeint (70 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (453 ms total)
[  PASSED  ] 2 tests.

balos1 commented 3 years ago

What would be the equivalent of CMAKE_BUILD_TYPE=Release?

-O3 is essentially the equivalent.

Can you replace the N_VIth_S macros in RHS and re run the tests on linux with the different optimizations? Specifically, replace the macros by getting the array pointer and indexing it directly like @drreynolds mentioned earlier.

double *ydata = NV_DATA_S(y);
double *yddata = NV_DATA_S(ydot);
yddata[0] = (alpha - beta * ydata[1]) * ydata[0];
yddata[1] = (-gamma + delta * ydata[0]) * ydata[1];

The reason why this could be the issue is that NV_DATA_S essentially does y->array, so when you do NV_Ith_S you have y->array[i], i.e. first you access the array pointer inside the vector structure, then access the element in the array. Perhaps this extra access is not being optimized away by the compiler.

yizhang-yiz commented 3 years ago

The reason why this could be the issue is that NV_DATA_S essentially does y->array, so when you do NV_Ith_S you have y->array[i], i.e. first you access the array pointer inside the vector structure, then access the element in the array. Perhaps this extra access is not being optimized away by the compiler.

Thanks. I've tried that on both linux + gcc & macos + clang and ended up with the same outcome.

balos1 commented 3 years ago

@yizhang-yiz What about when timing the RHS only? Does replacing N_VIth_S and using direct array access in the loops make a difference there?

yizhang-yiz commented 3 years ago

double* and NV_Ith_S show same results, on both ODE solution tests and RHS eval tests.

balos1 commented 3 years ago

@yizhang-yiz When you reported the timings from the RHS earlier, it showed that the odeint RHS took 0 ms, which seems off. I suspected the compiler was optimizing the loop away for the odeint case, so I recreated the experiment and saw that it did indeed do so. Thus I added two lines after calling the RHS to make sure the compiler left the loop.

See this gist for the code: https://gist.github.com/balos1/fe0f361139d03dad32673b11fc308961.

The results (on my macbook pro):

$ ./compare_rhs.03
odeint RHS elapsed time: 359.889 ms
odeint RHS count: 99999999
arkode RHS elapsed time: 359.054 ms
arkode RHS count: 99999999

When I ran w/o the NV_Ith_S macro, using direct array access instead in the RHS:

$ ./compare_rhs.03
odeint RHS elapsed time: 365.849 ms
odeint RHS count: 99999999
arkode RHS elapsed time: 364.643 ms
arkode RHS count: 99999999

Can you try compiling and running the code from the gist I posted and report what you see now?

yizhang-yiz commented 3 years ago

You're right. I should've used the result in the loop so it doesn't get optimized away. Now I'm getting similar RHS elapsed time.

What does this imply to the original issue? Note that the ODE solutions in both case are actually copied to an observer object, and if I print the ob.result I'm able to see consistent results.

yizhang-yiz commented 3 years ago

FWIW, by setting O=0 ARKode is actually faster, in both RHS evaluation and ODE solving.

[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from arkode
[ RUN      ] arkode.lotka
# of RHS evals ( arkode ) :4593101
[       OK ] arkode.lotka (1095 ms)
[ RUN      ] arkode.rhs_eval
arkode RHS elapsed time: 1669.89 ms
[       OK ] arkode.rhs_eval (1670 ms)
[----------] 2 tests from arkode (2765 ms total)

[----------] 2 tests from odeint
[ RUN      ] odeint.lotka
# of RHS evals ( odeint ) :4837436
[       OK ] odeint.lotka (1802 ms)
[ RUN      ] odeint.rhs_eval
odeint RHS elapsed time: 2559.6 ms
[       OK ] odeint.rhs_eval (2560 ms)
[----------] 2 tests from odeint (4362 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 2 test suites ran. (7127 ms total)
[  PASSED  ] 4 tests.

So perhaps this indicates ARKode is less friendly to compiler optimization? Or should I being using a different set of compiling flags? (I'm totally out of my depth in that domain).

balos1 commented 3 years ago

If the RHS perform similarly, and ARKODE does less work in terms RHS evals, it is much harder to say why ARKODE is slower.

It could be that, for this size of a problem, odeint is faster because it does not have deal with the overhead of function pointers like ARKODE does.

Or it could be that since odeint is a header only library the compiler can do more optimizations that cannot be done for ARKODE. I ran a slimmed down version of your code, and found that the optimization level used had a big impact on odeint.

Most likely, it is some combination of these.

One thing you could do is see how the timings compare as you increase the size of the problem.

yizhang-yiz commented 3 years ago

Actually if I change RHS timing test by following the same logic (same assignment using independent & dependent variables)

// arkode
  for (auto i = 0; i < n_eval; ++i) {
    ode(0.1, y, ydot);    
    NV_Ith_S(y, 0) = NV_Ith_S(ydot, 0);
    NV_Ith_S(y, 1) = NV_Ith_S(ydot, 1);
  }
// odeint
  for (long int i = 0; i < n_eval; ++i) {
    ode(y0, ydot, 0.1);
    y0[0] = ydot[0];
    y0[1] = ydot[1];
  }

I'm getting

arkode RHS elapsed time: 508.349 ms
odeint RHS elapsed time: 399.947 ms

so somehow the N_Vector is slightly less efficient than std::vector version, and this has nothing to do with arkode or odeint.

You have a good point that this may all due to overhead and optimized template implementation. It's just a bit surprising that there's a 5x difference.

Let me see how they perform when the problem size is scaled up.

balos1 commented 3 years ago

I did notice I had a bug in the original version of the gist I posted. With it fixed, I do see a difference in the RHS timings:

$ ./compare_rhs.03
odeint RHS elapsed time: 367.858 ms
odeint RHS count: 99999999
arkode RHS elapsed time: 871.287 ms
arkode RHS count: 99999999

I suppose this difference must be because the arkode RHS requires getting the array/pointer out of the N_Vector first (this is the only difference between the two RHS). So the odeint RHS is about 2.4x faster, and about 4.4x faster for the complete solve (on my machine).

For a larger problem, the overhead of accessing the array pointer should become negligible.

yizhang-yiz commented 3 years ago

Add some new tests. Test 1 shows the comparison of 4 RHS implementations

data pointer: RHS using raw pointer/array.
N_Vector 1: RHS using N_Vector
N_Vector 2: RHS using N_Vector, implemented in the loop.
odeint: RHS using std::vector output:
```
data pointer RHS eval elapsed time: 22.386 ms
N_Vector 1 RHS eval elapsed time: 64 ms
N_Vector 2 RHS eval elapsed time: 58.066 ms
odeint RHS elapsed time: 30.176 ms
```
Test 2 shows the same comparison but with a larger ODE system (same above ODE system repeated 10^5 times). Output:
```
data pointer RHS eval elapsed time: 165.873 ms
N_Vector 1 RHS eval elapsed time: 138.481 ms
odeint RHS elapsed time: 163.688 ms
```
Test 1 shows that the overhead of N_Vector is pretty significant when repeatedly solving a small ode and extracting pointer. Test 2 shows that this overhead becomes negligible in a large system. Unfortunately in my applications I'm mostly dealing with cases like test 1.

yizhang-yiz commented 3 years ago

I'm closing it. Thanks a lot for the discussion @balos1 .

LLNL / sundials

Understand ARKOde vs boost::odeint performance difference #66

of RHS evals ( arkode ) :4593101

of RHS evals ( odeint ) :4837436