Open bshanahan opened 8 months ago
Interesting! @pnorbert may have some insight here. Could you please post some lines from the log file, near the start and near the end?
In the next
branch there is also an ADIOS2 output option: Compiling with -DBOUT_DOWNLOAD_ADIOS=ON
and setting output:type=adios
should save an ADIOS2 dataset. See https://bout-dev.readthedocs.io/en/latest/user_docs/adios2.html
It would be good to know if this suffered from the same problem, or if it's a NetCDF thing.
Here are some lines early on in the simulation:
2.500e+04 1289 8.91e-01 67.3 7.1 4.4 3.8 17.4
2.510e+04 1130 7.86e-01 66.9 7.0 4.5 4.3 17.4
2.520e+04 1105 7.72e-01 66.8 7.0 4.4 4.4 17.4
2.530e+04 1027 7.27e-01 65.7 7.1 4.9 4.7 17.7
2.540e+04 1021 7.19e-01 66.0 7.0 4.7 4.7 17.6
2.550e+04 1149 7.99e-01 66.9 7.0 4.6 4.3 17.2
..and some at the end:
3.996e+05 1165 2.78e+00 20.0 2.3 1.8 68.3 7.6
3.997e+05 1015 2.62e+00 18.2 2.0 1.5 71.6 6.7
3.998e+05 1192 2.75e+00 20.2 2.3 1.7 68.9 7.0
3.999e+05 1058 2.63e+00 18.7 2.1 1.5 71.4 6.4
4.000e+05 1157 2.71e+00 20.0 2.2 1.5 69.2 7.2
I/O is the second-to-last column.
If you restart the simulation, does the IO time "reset"?
Yes, it does, as far as I remember. So it is no major issue, as restarting resolves it.
Ok, that's a little reassuring I guess. Next question is does this appear in non-hermes-3 models? I can probably check this with something like blob2d or conduction, as they're pretty fast to run.
I just noticed in a hermes-2 simulation that it got OOM killed after 15 time steps. This is probably indirectly related to #2900 as it increases the memory usage, but that also seems to imply that the memory consumption (slightly) increases over time.
This can be seen even in conduction
over 500 timesteps:
This is absolute timings of I/O
(in seconds) and the problem is present in both master and next.
I haven't managed to work out if there's a memory leak, but the increase in IO time does appear to be mostly coming from the following lines:
The top line is for variables without a time dimension, and the bottom one is for those with one. It turns out we are writing a lot of time-independent variables on every timestep, and this both takes a decent fraction of the total IO time and increases in cost over the course of the run.
I'll trying switching to ADIOS tomorrow and see if that makes a difference.
The reason we're writing all these time-independent things every time step is because we reuse PhysicsModel::output_options
for the whole run. We could maybe do something like clear that variable after the initial write?
Well, here's an easy speed up:
modified src/physics/physicsmodel.cxx
@@ -253,6 +253,7 @@ int PhysicsModel::PhysicsModelMonitor::call(Solver* solver, BoutReal simtime,
solver->outputVars(model->output_options, true);
model->outputVars(model->output_options);
model->writeOutputFile();
+ model->output_options = Options{};
// Call user output monitor
return model->outputMonitor(simtime, iteration, nout);
This halves the IO time for conduction
but doesn't get rid of the slow down over time. It's plausible there's an issue with netCDF here.
In the
next
branch there is also an ADIOS2 output option: Compiling with-DBOUT_DOWNLOAD_ADIOS=ON
and settingoutput:type=adios
should save an ADIOS2 dataset. See https://bout-dev.readthedocs.io/en/latest/user_docs/adios2.html It would be good to know if this suffered from the same problem, or if it's a NetCDF thing.
I've checked this now, and it is just netCDF -- ADIOS2 is about a factor 10 faster for the conduction
example in fact.
All my profiling points towards putVar
, or at least a very small region around calls to that, but I can't reproduce the slowdown outside of BOUT++, and I can't see how any of our code would be causing this issue.
Ok, I finally have an MVCE in pure netcdf:
#include <array>
#include <cstddef>
#include <netcdf>
#include <chrono>
#include <format>
#include <vector>
using clock_type =
typename std::conditional<std::chrono::high_resolution_clock::is_steady,
std::chrono::high_resolution_clock,
std::chrono::steady_clock>::type;
using seconds = std::chrono::duration<double, std::chrono::seconds::period>;
using namespace netCDF;
constexpr auto loops = 1000;
constexpr auto repeats = 10;
constexpr std::size_t nx = 5;
constexpr std::size_t ny = 5;
constexpr auto filename = "test.nc";
int main() {
NcFile file{filename, NcFile::replace};
const std::array<int, nx * ny> arr {
nx, nx, nx, nx, nx,
nx, nx, nx, nx, nx,
nx, nx, nx, nx, nx,
nx, nx, nx, nx, nx,
nx, nx, nx, nx, nx
};
auto time_dim = file.addDim("t");
auto x_dim = file.addDim("x", nx);
auto y_dim = file.addDim("y", ny);
NcVar arr_var = file.addVar("arr_var", ncInt, {time_dim, x_dim, y_dim});
std::size_t time = 0;
for (auto loop = 0; loop < loops; loop++) {
const std::vector<std::size_t> start {time++};
const auto started = clock_type::now();
for (auto repeat = 0; repeat < repeats; repeat++) {
arr_var.putVar(start, {1, nx, ny}, arr.data());
file.sync();
}
// file.sync();
const auto finished = clock_type::now();
const auto time = finished - started;
std::cout << std::format("{:3}: {:8.6e}", loop, seconds{time}.count()) << "\n";
}
return 0;
}
(y axis in seconds here, x axis is outer loop iteration)
Moving the file.sync()
out of the inner loop massively reduces the slowdown:
Much faster overall, although the slowdown is still present when run over a longer period, but it does appear to saturate (figure here is averaged over a 100-step rolling window):
The improved overall performance makes sense: we're hitting disk less often. I still don't understand the cause of the slowdown, but it's apparent it's inside netCDF.
The solution for BOUT++ then is maybe to reintroduce the periodic-flush argument, instead of flushing every single timestep. This increases the risk of data loss, but should improve performance. We could also try periodically closing and re-opening the file, but this does attract its own overheads.
Alternatively, switching to ADIOS2 also gives significant performance improvements.
I've been running the 2D drift-plane turbulence example in Hermes-3 for many timesteps, and I/O is responsible for about 70% of the computational time per timestep by the 3000th output (compared to 1-2% in the initial timesteps, which have fewer RHS evals.).