boutproject / BOUT-dev

BOUT++: Plasma fluid finite-difference simulation code in curvilinear coordinate systems
http://boutproject.github.io/
GNU Lesser General Public License v3.0
184 stars 95 forks source link

DataFile rewrite design #1254

Closed bendudson closed 11 months ago

bendudson commented 6 years ago

There are various issues related to the current DataFile implementation e.g issues #222, #221, #102, #412, #374, #367, #644 . The current design is convenient in many cases, but leads to nasty surprises when more complicated things are attempted. As a reminder, in the current system:

This makes sense for data which is all output together at fixed time points (e.g. classic NetCDF), but is too restrictive if we want more than one time index (e.g. high, low frequency). It also makes adding diagnostics cumbersome, since temporary variables must be created and added to store the diagnostic values.

I think a new design should have a clearer read() and write() semantics, so to read a value becomes something like:

Field3D var = dump.read("someVar");

and writing a value is something like:

dump.write("someVar", new_value);

There are many different ways of doing this, and several ways we could handle appending time indices. Probably good to look at how other C++11 libraries handle things. For example: https://github.com/BlueBrain/HighFive

bendudson commented 6 years ago

Quickly put together a minimal working version of an interface in branch "new-fileio".

Header file code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/include/bout/fileio.hxx

Example code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/examples/fileio/test-fileio.cxx

I quite like being able to read data using int value = dump["val"];, but am less sure about dump["val"] = 3;. It's concise, but it's assigning an int to a Dataset, so what should happen when a Dataset is assigned to another Dataset? Something more explicit might be

dump["val"].setValue(3);

which is explicit but a bit ugly, or maybe have another class (DatasetValue?) so that we write:

dump["val"].value = 3;
...
int value = dump["val"].value;

Perhaps that might be clearer, since Dataset will also contain attributes, dimensions etc.

Should assigning one Dataset to another be allowed? What should happen here?

Dataset data = dump["val"];
data = dump["another"];

I think a reasonable interpretation is that "val" in the file is modified to have the same value and attributes as "another". Someone could also read it as just changing data to refer to a different variable in the file, not modifying any data.

ZedThree commented 3 years ago

I've started looking at this in earnest now, and there are some design decisions we need to make. This post is quite long, but there's a fair bit to consider, and if we're going to break things, it's probably best to make sure we do think about everything!

To summarise some downsides of the current approach:

Given that fixing basically any one of these requires significant changes, let's consider a complete rewrite, including breaking changes to the API and physics models. It would be nice if we could mechanically upgrade physics models to use whatever replacement we decide on.

We have one option already OptionsNetCDF: I think this evolved out of Ben's proposal above. This basically reads an entire netCDF file into a new Options instance, or takes an Options and writes it to a file immediately. I've started attempting to replace DataFile with it, and it definitely looks like an improvement.

Another option might be an interface that immediately reads/writes values, rather than a whole Options tree.

dump.write("field", Field3D{1.0});
Field3D field = dump.read("field");

We'd probably want to use a variant to store the different types we use, which means we're very likely to want to reuse a fair of the code in Options for handling conversions. Now that I think about it, this could probably easily work with OptionsNetCDF:

template <typename T> // but constrained to Options::ValueType members?
void OptionsNetCDF::write(const std::string& name, T value) {
  Options options;
  options[name] = std::move(value); // still involves a copy, just in the function signature
  write(options);
}

template <typename T>
T OptionsNetCDF::read(const std::string& name) {
  return read()[name]; // pretty sure this still has a copy
}

In either case, we have a slight stumbling block when it comes to writing auxiliary time-evolving variables (that is, variables not added to the solver). For time-evolving variables, Solver::outputVars can do the copying/writing without changing user code. PhysicsModel already has a virtual postInit method that can be overridden by the user -- however to use it effectively would require the user to make sure they call PhyiscsModel::postInit() in order to handle the restart file (or copy that code, of course). We might prefer to add a new virtual method that allows the user to output any other variables they like. Such a method would end up looking like:

void MyModel::outputVars() override {
  // OptionsNetCDF API
  Options options;     // could be passed in or a member of PhysicsModel?
  options["field"] = Field3D{1.0};
  options["field"].attributes["time_dimension"] = "t";
  dump.write(options); // or alternatively return options and have this call within the library?

  // Alternative, immediate-write API
  dump.write("field", Field3D{1.0}, DumpFile::time_evolving); // Needs an argument for _which_ time dimension?
}

I think it should be possible to scan a user's physics model source code and move dump.add() calls to such a function and translate them to a new API -- assuming we could work out which calls are intended to be writes, rather than reads!

A variation on the above is to pass an Options into PhysicsModel::init/rhs and have the user add variables to that (possibly returning the Options to be passed to the solver? This would definitely be a bigger refactor). This might be much worse though, as it would involve copying into the Options every internal timestep. The first option of a separate method would only need to be called every output timestep.

Other things to consider:

Phew! I'm sure I've still missed something important. I'm very interested to people's thoughts on any of this. I think I'm going to press on trying to replace DataFile with OptionsNetCDF to see if any other issues crop up.

johnomotani commented 3 years ago

Scattered thoughts:

johnomotani commented 3 years ago

We might need to be a bit careful about exactly when some quantities are written. For example, in STORM we normalise the metric components in the PhysicsModel::init() method, so writing of Coordinates quantities ideally should happen after init() is called (rather than, e.g. in the Coordinates constructor, which would seem like a natural place).

ZedThree commented 3 years ago

Thanks John, some really useful points to think about there. Some more scattered thoughts in response!

ZedThree commented 3 years ago

So a pretty severe limitation of netCDF is not being able to get the current extent of individual variables, only the maximum extent of an unlimited dimension (i.e. time). I've opened an issue about it, maybe they will add something -- but even if they do, we would then rely on a bleeding edge version of netCDF.

I've bumped into this problem again due to keeping an OptionsNetCDF instance around for the immediate-write API. Ben's original workaround for the above netcdf issue was to have a map<int, size_t> to cache the size of the time dimension. This works when writing an entire Options to file using a temporary OptionsNetCDF, basically only this idiom:

OptionsNetCDF(filename).write(options);

and the following won't work:

OptionsNetCDF file(filename);
file.write(options);
file.write(options);

I can see two potential ways around this:

  1. Drop down to HDF5. If we additionally rely on HDF5 (and we would sort of do implicitly as netcdf-cxx4 requires it), we can use H5Sget_simple_extent_dims to get this information. Although that has a limitation of needing to close the file with netcdf first.

  2. Keep track of the last index written in an attribute.

I've opted for the second option, which results in the time-dependent variable writing bit looking like this:

          // NetCDF doesn't keep track of the current for each
          // variable (although the underlying HDF5 file does!), so we
          // need to do it ourselves. We'll use an attribute in the
          // file to do so, which means we don't need to keep track of
          // it in the code
          int current_time_index;
          const auto atts_map = var.getAtts();
          const auto it = atts_map.find("current_time_index");
          if (it == atts_map.end()) {
            // Attribute doesn't exist, so let's start at zero. There
            // are various ways this might break, for example, if the
            // variable was added to the file by a different
            // program. But note, that if we use the size of the time
            // dimension here, this will increase every time we add a
            // new variable! So zero is probably the only sensible way
            // to do this
            current_time_index = 0;
          } else {
            it->second.getValues(&current_time_index);
          }

          std::vector<size_t> start_index; ///< Starting index where data will be inserted
          std::vector<size_t> count_index; ///< Size of each dimension

          // Dimensions, including time
          for (const auto& dim : dims) {
            start_index.push_back(0);
            count_index.push_back(dim.getSize());
          }
          // Time dimension
          start_index[0] = current_time_index;
          count_index[0] = 1; // Writing one record

          fmt::print("Writing {} at time {}\n", name, current_time_index);

          // Put the data into the variable
          bout::utils::visit(NcPutVarCountVisitor(var, start_index, count_index),
                             child.value);

          // Make sure to update the time index, including incrementing it!
          var.putAtt("current_time_index", ncInt, ++current_time_index);

This will break if someone adds a time-dependent variable to the file without adding the current_time_index attribute. We'll just start overwriting it from the beginning index.

The first option is likely more robust, but as we'll need to close the file first, I think that might be quite complicated in practice.

ZedThree commented 3 years ago

For context, the current Ncxx4 implementation of Dataformat tracks the latest time index for each variable with a map in the Ncxx4 instance (whereas H5Format can just query this information). Every time that Ncxx4::write_rec is called for a variable, it increments the corresponding index.

After chatting with @johnomotani, we realised that one problem with my proposed method above is that it's possible for different variables to get out of sync, whereas with Ncxx4, this isn't possible because we are forced to write all the variables at once.

John suggested a dump.extend_dimension("t") method that extends the dimension by one, and then the write method always writes to the latest index of that dimension. We would just then require that there be exactly one call to extend_dimension per timestep. Missing the call would result in overwriting the data, an extra call would result in an "empty" timestep. For the default time dimension, we'd call it in the library so the user would only need to call it for different frequency time dimensions. That sounds reasonable to me, but I'd be interested in hearing other viewpoints.

A separate thought, but one I want to write down somewhere -- we have a time dimension t which is timestep index, and a separate variable t_array which is simulation time -- should we just make t == t_array? I'm not sure what it breaks if we do, but might be interesting to do

johnomotani commented 3 years ago

A separate thought, but one I want to write down somewhere -- we have a time dimension t which is timestep index, and a separate variable t_array which is simulation time -- should we just make t == t_array? I'm not sure what it breaks if we do, but might be interesting to do

I think a netCDF dimension is only an array-size: the 'dimension' has to be a list of consecutive integers starting from 0. https://www.unidata.ucar.edu/software/netcdf/docs/group__dimensions.html#details So it couldn't represent t_array, which is an array of floats.

ZedThree commented 3 years ago

That's the dimension IDs, rather than the dimensions themselves. It turns out if you just define a dimension like we currently do, netCDF basically just makes an empty dataset. If you then also define a variable with the same name, you can then give it values. Here's one of their examples demonstrating this: https://www.unidata.ucar.edu/software/netcdf/docs/pres__temp__4D__wr_8c_source.html

Thinking a little bit more about keeping track of the last time index for each variable -- instead of needing to call dump.extend_dimension("t") at the start of each output timestep, what about having a dump.finish_timestep("t") that checks all variables with the t dimension have the same size? Forgetting to call it wouldn't introduce any new problems, although it wouldn't be able to catch any. For the main dump files, we can probably do this in the library to completely automate it.

johnomotani commented 3 years ago

That's the dimension IDs, rather than the dimensions themselves. It turns out if you just define a dimension like we currently do, netCDF basically just makes an empty dataset. If you then also define a variable with the same name, you can then give it values. Here's one of their examples demonstrating this: https://www.unidata.ucar.edu/software/netcdf/docs/pres__temp__4D__wr_8c_source.html

Ah, OK - it's a convention https://www.unidata.ucar.edu/software/netcdf/workshops/2011/datamodels/NcCVars.html (a very logical one!), and a convention that's used by xarray. In xbout we already rename t_array to t when loading for exactly this reason.

johnomotani commented 3 years ago

Thinking a little bit more about keeping track of the last time index for each variable -- instead of needing to call dump.extend_dimension("t") at the start of each output timestep, what about having a dump.finish_timestep("t") that checks all variables with the t dimension have the same size? Forgetting to call it wouldn't introduce any new problems, although it wouldn't be able to catch any. For the main dump files, we can probably do this in the library to completely automate it.

That seems sensible to me as long as the main library does take care of the checking for the standard output t dimension. Maybe dump.finish_timestep("t") could return the current length of the dimension, then the main library could check it against the iteration count for extra safety?

If a user (or e.g. FastOuput) want to define a different unlimited dimension and append to it, they can be responsible for adding the check themselves. finish_timestep() seems like a good plan - check can be included for 'production code' but doesn't matter too much if it's skipped for quick debugging, etc. :+1:

BTW I just found out (https://www.unidata.ucar.edu/software/netcdf/docs/unlimited_dims.html) that in netCDF-4 files you're allowed any number of unlimited dimensions, but in netCDF 'classic' you can only have one - another reason to only support netCDF-4 in v5.

ZedThree commented 3 years ago

I've made some decent progress on this, but there's (at least) three points that need a bit of thinking about:

  1. What should the new API look like?
  2. Who owns the output file?
  3. What's the replacement for dump.addRepeat?

For the first question, my current design adds:

  /// Output additional variables other than the evolving variables
  virtual void outputVars(MAYBE_UNUSED(Options& options)) {}
  /// Add additional variables other than the evolving variables to the restart files
  virtual void restartVars(MAYBE_UNUSED(Options& restart)) {}

to PhysicsModel. These get called in PhysicsModel::postInit and PhysicsModel::PhysicsModelMonitor::call. This means auxiliary variables could be calculated only when needed.

I'm hoping this will make it reasonably easy to automate an upgrade. dump.add calls get moved to outputVars and translated to options[name] = var, while restart.add calls get moved to restartVars and translated.

Does the separation into two methods make sense?


For the second point, currently PhysicsModel owns the output file, and either PhysicsModel::PhysicsModelMonitor::call calls the outputVars method on other objects to populate the Options to write out; or for the monitors, there's now Solver::writeToModelOutputFile(const Options& options) which immediately writes options to the output file.

Solver::writeToModelOutputFile is needed because the monitors (e.g. BoutMonitor which writes timing informaiton) only have access to the solver, and not to the model.

Currently, it is definitely easiest for either the model or the solver to own the output file, as they have pointers to each other, and can easily call mesh and coordinates methods. Ideally, I think there'd be a Simulation class that owned the mesh, solver, model, and output file. Currently we can't (easily) pass in the output file to the physics model because most (all?) user models don't have a constructor. We could add a physics model constructor that took the output file (and anything else we want), if we simply added

using PhysicsModel::PhysicsModel;

to user models. This is definitely possible to automate, and we're already have a physics model upgrader tool.

I'm trying to encapsulate things enough that such a refactoring wouldn't be too hard.


dump.addRepeat(field, "field") currently translates to:

options["field"] = field;   // Need to use .force() for ints and BoutReals
options["field"].attributes["time_dimension"] = "t";

which is both a bit unwieldy, and has more surface area for mistakes. I keep coming back to something like:

template <class T>
Options::assignRepeat(T value, std::string time_dimension = "t", std::string source = "") {
  force(value, source);
  attributes["time_dimension"] = time_dimension;
}

options["field"].assignRepeat(field);

which I'm not overly happy with, but might be the best we can do? There's still the risk of mixing up time_dimension and source -- not sure which way round is best.


I'd love to hear peoples thoughts on any of these points.

ZedThree commented 3 years ago

A complication for automatic upgrading: some variables are only conditionally written. Here's an example from test-drift-instability:

    if (evolve_ajpar) {
      solver->add(Ajpar, "Ajpar");
      comms.add(Ajpar);
      output.write("ajpar\n");
    } else {
      initial_profile("Ajpar", Ajpar);
      if (ZeroElMass) {
        SAVE_REPEAT(Ajpar); // output calculated Ajpar
      }
    }

That's basically going to be impossible to correctly translate with regexp. There is a clang Python API that might make it possible, but it's still probably going to be tricky.

I think @bendudson suggested storing pointers in PhysicsModel similar to the way DataFile currently works, as a sort of stop-gap measure -- looks like that might be necessary.

bendudson commented 3 years ago

Thanks @ZedThree I think there might be a way to keep some kind of backward compatibility:

  1. Create a new class with a similar interface to DataFile, and make an instance of it (dump) in the PhysicsModel class
    
    class DataFileFacade {
    ...
    saveRepeat(...);
    saveOnce(...);
    /// Store pointers to variables as DataFile does now
    };

class PhysicsModel { ... DataFileFacade dump; }


2. Use a virtual function with a default implementation which returns the data to be written, taken from the `DataFileFacade` object:

class PhysicsModel { ... virtual Options outputData() { Options options; for name, varptr in dump { options[name] = *var; } return options; } };


So if a user wanted to write different things to output, they could implement this `outputData` method, but the old method would also work: A call to `SAVE_REPEAT(Ajpar)` would store a pointer in `DataFileFacade dump`, which would then be put into the `Options` tree in `outputData`. 

I'm undecided whether passing in an `Options&` to modify would be better than constructing and returning an `Options`, but think this way might be cleaner.

One thing I'm not sure about is how to add multiple monitors. A different virtual function for high frequency output (`outputDataFast`?), or an argument to `outputData` with some kind of tag to indicate which monitor it is?
ZedThree commented 3 years ago

That looks good! I'll have a go at implementing that. One slightly annoying thing is that we basically want the Options::value_type variant, but pointers instead of values. That would simplify the DataFileFacade a lot. It doesn't look like there's a way to transform a variant or iterate over its types, which is a pity, so it'll just have to duplicate the type list.

I'm undecided whether passing in an Options& to modify would be better than constructing and returning an Options, but think this way might be cleaner.

We currently don't have a way to "flat-merge" Options (if that makes sense), so I'm currently passing in the Options& to be modified.

bendudson commented 3 years ago

I would also like to see a way to decouple the PhysicsModel and Solver classes, which currently have pointers to each other (bear with me, I think it's sort-of related). I think it should be possible for the PhysicsModel to not have a pointer to Solver, but rather have one method which gets the state (probably as an Options tree :)) and the rhs which takes the state as input. The existing calls to SOLVE_FOR etc. would register pointers in something like DataFileFacade, allowing backward compatibility.

bendudson commented 3 years ago

Ah yes, I guess a flat-merge would be needed at the moment. Eventually it would be nice to have each PhysicsModel output in a tree, but XArray doesn't handle groups? A flat merge should be fairly straightforward to write, but I'm not sure how (in-)efficient it would be.

ZedThree commented 3 years ago

I think the issue is that xarray doesn't read groups out of the box, but if you know what groups to read, it works?

But yes, I would also like more structure in the output file! We were talking about saving the inputs from each restart in some group structure too.

bendudson commented 3 years ago

Had a thought about the different monitors: When making quantities in the output time-dependent, the label of the time coordinate is specified. The outputData function (or whatever it's called) could be passed the name of the time coordinate being extended?

Options outputData(const string& time_dimension);

The default is "t", but if a monitor is created which outputs at a different rate, it would use a different time dimension?

ZedThree commented 3 years ago

Ah good point! We'll need to keep track of time_dimension somehow -- it will either need to be stored in the Monitor base class, or Solver stores a vector of struct MonitorInfo { Monitor* monitor; std::string time_dimension; } instead of just Monitor*. Either one is doable.

Also, I realised that merging Options isn't essential -- as long as the OptionsNetCDF is appending, we can just write the Options result by itself, no need to wait till we've got one big Options to write all at once. It's possible there are performance considerations, but it is possible.

johnomotani commented 3 years ago

Ah yes, I guess a flat-merge would be needed at the moment. Eventually it would be nice to have each PhysicsModel output in a tree, but XArray doeson't handle groups?

xarray's open_dataset() and open_mfdataset() functions take a group argument so they can open a single group from a netcdf file. If we have multiple groups, we'd probably want to load all of them and merge them into a single xarray Dataset (e.g. if we have separate groups for 'Coordinates' and 'PhysicsModel variables', we'd still want all the variables to be defined on the same dimensions so we can combine them together in various ways). At the moment that would mean opening the files once for each group that we need to read, which is doable (in xBOUT) but probably not the most efficient.

There's been an on-and-off debate on xarray about how to handle groups - as far as I understand it's not resolved because there's not an obvious solution that works for every use case. We could propose some new functionality for xarray to make things more efficient though, especially if we have a workaround that's acceptable in the meantime. For example, being able to load and merge multiple groups (optionally with some prefix on the variable names or some other way of recording which variables belong to which group) while only opening the file once would go a long way to letting us do (efficiently) whatever we want with xBOUT. That seems generic enough that I think the xarray devs might be open to a PR adding it, but it'd probably take at least a day or two working on it, so I'd like to be sure of what we want/need before starting.

ZedThree commented 3 years ago

The replacement of DataFile with OptionsNetCDF for the main dump file is almost complete. Still need to replace it in GridFile, but I'm hoping that's fairly self-contained and easy. A refactoring of Mesh::get is sort of orthogonal to this.

One thing I've ignored so far is the various options to DataFile:

  OPTION(opt, parallel, false); // By default no parallel formats for now
  OPTION(opt, flush, true);     // Safer. Disable explicitly if required
  OPTION(opt, guards, true);    // Compatible with old behavior
  OPTION(opt, floats, false); // High precision by default
  OPTION(opt, openclose, true); // Open and close every write or read
  OPTION(opt, enabled, true);
  OPTION(opt, init_missing, false); // Initialise missing variables?
  OPTION(opt, shiftoutput, false); // Do we want to write 3D fields in shifted space?
  OPTION(opt, shiftinput, false); // Do we want to read 3D fields in shifted space?
  OPTION(opt, flushfrequency, 1); // How frequently do we flush the file

Of these, I've only kept enabled so far, and not as a general option, just specifically for the output and restart files.

For the others:

Which just leaves shiftinput/shiftoutput. These are probably useful? At least shiftinput could be necessary in order to read older files?

johnomotani commented 3 years ago

shiftinput is only useful (I think) for people moving from v3 to v5, although I guess there might be some people from LLNL or Japan who want to do that. If it's not super easy to implement, it'd probably be straightforward to make a pre-processing tool with xBOUT.

I'm not sure what guards was intended for initially, but it could be useful to have the option to not save the guard cells. It would save quite a lot of disk space, and might also speed up collecting data because we'd be reading contiguous arrays rather than needing to slice of guard cells. If we do implement this, I vote for always including boundary cells, even if we drop boundaryguard cells.

ZedThree commented 3 years ago

Yeah, guards does sound useful -- always including boundaries would be essential. Sounds like that could wait for another PR though :)

Ok, next stumbling block: GridFromFile. The bulk of the complexity here is in calculating the offsets to read the appropriate hyperplane from file for a Field2D/3D/Perp on a given processor.

Off the top of my head, there's two routes we could go down:

  1. Read the entire grid file into an Options with OptionsNetCDF, and then take the appropriate hyperplanes from the Matrix/Tensor
  2. Move the hyperplane functionality into OptionsNetCDF

The first option is probably easy to implement, but would mean we necessarily have to read in the entire gridfile on every processor.

The second option is probably nicer in terms of memory use, etc., and is probably more amenable to eventually using parallel netCDF down the line again perhaps? But poses more questions in terms of API.

Maybe just an overload of read that takes a const Mesh&?

Options data = OptionsNetCDF{filename}.read(mesh);

(maybe optional rank too for testing?)

ZedThree commented 3 years ago

Oh, the other difficulty with option 2 is that when we read in the data in OptionsNetCDF, we read 1D data into an Array, 2D data into Matrix, and 3D data into Tensor, so we can't distinguish between Field2D and FieldPerp. That much at least is necessary for choosing the right offsets and extents.

There's also the question of older files needing FFTs.

bendudson commented 11 months ago

This is now done (#2209), and supports both NetCDF and ADIOS2 output (#2766). The next step will probably be to remove the DataFileFacade backward-compatibility layer.