I/O Increases drastically in long simulations

bshanahan commented 8 months ago

I've been running the 2D drift-plane turbulence example in Hermes-3 for many timesteps, and I/O is responsible for about 70% of the computational time per timestep by the 3000th output (compared to 1-2% in the initial timesteps, which have fewer RHS evals.).

bendudson commented 8 months ago

Interesting! @pnorbert may have some insight here. Could you please post some lines from the log file, near the start and near the end?

In the next branch there is also an ADIOS2 output option: Compiling with -DBOUT_DOWNLOAD_ADIOS=ON and setting output:type=adios should save an ADIOS2 dataset. See https://bout-dev.readthedocs.io/en/latest/user_docs/adios2.html It would be good to know if this suffered from the same problem, or if it's a NetCDF thing.

bshanahan commented 7 months ago

Here are some lines early on in the simulation:

2.500e+04       1289       8.91e-01    67.3    7.1    4.4    3.8   17.4
2.510e+04       1130       7.86e-01    66.9    7.0    4.5    4.3   17.4
2.520e+04       1105       7.72e-01    66.8    7.0    4.4    4.4   17.4
2.530e+04       1027       7.27e-01    65.7    7.1    4.9    4.7   17.7
2.540e+04       1021       7.19e-01    66.0    7.0    4.7    4.7   17.6
2.550e+04       1149       7.99e-01    66.9    7.0    4.6    4.3   17.2

..and some at the end:

3.996e+05       1165       2.78e+00    20.0    2.3    1.8   68.3    7.6
3.997e+05       1015       2.62e+00    18.2    2.0    1.5   71.6    6.7
3.998e+05       1192       2.75e+00    20.2    2.3    1.7   68.9    7.0
3.999e+05       1058       2.63e+00    18.7    2.1    1.5   71.4    6.4
4.000e+05       1157       2.71e+00    20.0    2.2    1.5   69.2    7.2

I/O is the second-to-last column.

ZedThree commented 7 months ago

If you restart the simulation, does the IO time "reset"?

dschwoerer commented 7 months ago

Yes, it does, as far as I remember. So it is no major issue, as restarting resolves it.

ZedThree commented 7 months ago

Ok, that's a little reassuring I guess. Next question is does this appear in non-hermes-3 models? I can probably check this with something like blob2d or conduction, as they're pretty fast to run.

dschwoerer commented 7 months ago

I just noticed in a hermes-2 simulation that it got OOM killed after 15 time steps. This is probably indirectly related to #2900 as it increases the memory usage, but that also seems to imply that the memory consumption (slightly) increases over time.

ZedThree commented 7 months ago

This can be seen even in conduction over 500 timesteps:

This is absolute timings of I/O (in seconds) and the problem is present in both master and next.

ZedThree commented 7 months ago

I haven't managed to work out if there's a memory leak, but the increase in IO time does appear to be mostly coming from the following lines:

https://github.com/boutproject/BOUT-dev/blob/1d52b383ab94c89790d716d8d7e1df4d861125ab/src/sys/options/options_netcdf.cxx#L517

https://github.com/boutproject/BOUT-dev/blob/1d52b383ab94c89790d716d8d7e1df4d861125ab/src/sys/options/options_netcdf.cxx#L537

The top line is for variables without a time dimension, and the bottom one is for those with one. It turns out we are writing a lot of time-independent variables on every timestep, and this both takes a decent fraction of the total IO time and increases in cost over the course of the run.

I'll trying switching to ADIOS tomorrow and see if that makes a difference.

The reason we're writing all these time-independent things every time step is because we reuse PhysicsModel::output_options for the whole run. We could maybe do something like clear that variable after the initial write?

ZedThree commented 7 months ago

Well, here's an easy speed up:

modified   src/physics/physicsmodel.cxx
@@ -253,6 +253,7 @@ int PhysicsModel::PhysicsModelMonitor::call(Solver* solver, BoutReal simtime,
   solver->outputVars(model->output_options, true);
   model->outputVars(model->output_options);
   model->writeOutputFile();
+  model->output_options = Options{};

   // Call user output monitor
   return model->outputMonitor(simtime, iteration, nout);

This halves the IO time for conduction but doesn't get rid of the slow down over time. It's plausible there's an issue with netCDF here.

ZedThree commented 6 months ago

In the next branch there is also an ADIOS2 output option: Compiling with -DBOUT_DOWNLOAD_ADIOS=ON and setting output:type=adios should save an ADIOS2 dataset. See https://bout-dev.readthedocs.io/en/latest/user_docs/adios2.html It would be good to know if this suffered from the same problem, or if it's a NetCDF thing.

I've checked this now, and it is just netCDF -- ADIOS2 is about a factor 10 faster for the conduction example in fact.

All my profiling points towards putVar, or at least a very small region around calls to that, but I can't reproduce the slowdown outside of BOUT++, and I can't see how any of our code would be causing this issue.

ZedThree commented 6 months ago

Ok, I finally have an MVCE in pure netcdf:

#include <array>
#include <cstddef>
#include <netcdf>
#include <chrono>
#include <format>
#include <vector>

using clock_type =
  typename std::conditional<std::chrono::high_resolution_clock::is_steady,
                            std::chrono::high_resolution_clock,
                            std::chrono::steady_clock>::type;
using seconds = std::chrono::duration<double, std::chrono::seconds::period>;

using namespace netCDF;

constexpr auto loops = 1000;
constexpr auto repeats = 10;

constexpr std::size_t nx = 5;
constexpr std::size_t ny = 5;

constexpr auto filename = "test.nc";

int main() {

  NcFile file{filename, NcFile::replace};

  const std::array<int, nx * ny> arr {
    nx, nx, nx, nx, nx,
    nx, nx, nx, nx, nx,
    nx, nx, nx, nx, nx,
    nx, nx, nx, nx, nx,
    nx, nx, nx, nx, nx
  };

  auto time_dim = file.addDim("t");
  auto x_dim = file.addDim("x", nx);
  auto y_dim = file.addDim("y", ny);
  NcVar arr_var = file.addVar("arr_var", ncInt, {time_dim, x_dim, y_dim});

  std::size_t time = 0;

  for (auto loop = 0; loop < loops; loop++) {
    const std::vector<std::size_t> start {time++};
    const auto started = clock_type::now();

    for (auto repeat = 0; repeat < repeats; repeat++) {
      arr_var.putVar(start, {1, nx, ny}, arr.data());
      file.sync();
    }

    // file.sync();

    const auto finished = clock_type::now();
    const auto time = finished - started;

    std::cout << std::format("{:3}: {:8.6e}", loop, seconds{time}.count()) << "\n";
  }

  return 0;
}

(y axis in seconds here, x axis is outer loop iteration)

Moving the file.sync() out of the inner loop massively reduces the slowdown:

Much faster overall, although the slowdown is still present when run over a longer period, but it does appear to saturate (figure here is averaged over a 100-step rolling window):

The improved overall performance makes sense: we're hitting disk less often. I still don't understand the cause of the slowdown, but it's apparent it's inside netCDF.

The solution for BOUT++ then is maybe to reintroduce the periodic-flush argument, instead of flushing every single timestep. This increases the risk of data loss, but should improve performance. We could also try periodically closing and re-opening the file, but this does attract its own overheads.

Alternatively, switching to ADIOS2 also gives significant performance improvements.

boutproject / BOUT-dev

I/O Increases drastically in long simulations #2878