DataFile rewrite design

boutproject / BOUT-dev

BOUT++: Plasma fluid finite-difference simulation code in curvilinear coordinate systems

http://boutproject.github.io/

GNU Lesser General Public License v3.0

184 stars 95 forks source link

DataFile rewrite design #1254

Closed bendudson closed 11 months ago

bendudson commented 6 years ago

There are various issues related to the current DataFile implementation e.g issues #222, #221, #102, #412, #374, #367, #644 . The current design is convenient in many cases, but leads to nasty surprises when more complicated things are attempted. As a reminder, in the current system:

DataFile provides a consistent interface to various DataFormat classes, which each implement interfaces to the NetCDF, HDF libraries etc.
Desired outputs are added to a DataFile object. Internally DataFile stores pointers to these objects, which therefore can't be destroyed (e.g. go out of scope):
```
Field3D field1, field2;
dump.addOnce(field1, "someVar");
dump.addRepeat(field2, "anotherVar");
```
To read data a single call to dump.read() loads all data into the fields
A call to dump.write() saves data from all the fields.

This makes sense for data which is all output together at fixed time points (e.g. classic NetCDF), but is too restrictive if we want more than one time index (e.g. high, low frequency). It also makes adding diagnostics cumbersome, since temporary variables must be created and added to store the diagnostic values.

I think a new design should have a clearer read() and write() semantics, so to read a value becomes something like:

Field3D var = dump.read("someVar");

and writing a value is something like:

dump.write("someVar", new_value);

There are many different ways of doing this, and several ways we could handle appending time indices. Probably good to look at how other C++11 libraries handle things. For example: https://github.com/BlueBrain/HighFive

bendudson commented 6 years ago

Quickly put together a minimal working version of an interface in branch "new-fileio".

Header file code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/include/bout/fileio.hxx

Example code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/examples/fileio/test-fileio.cxx

I quite like being able to read data using int value = dump["val"];, but am less sure about dump["val"] = 3;. It's concise, but it's assigning an int to a Dataset, so what should happen when a Dataset is assigned to another Dataset? Something more explicit might be

dump["val"].setValue(3);

which is explicit but a bit ugly, or maybe have another class (DatasetValue?) so that we write:

dump["val"].value = 3;
...
int value = dump["val"].value;

Perhaps that might be clearer, since Dataset will also contain attributes, dimensions etc.

Should assigning one Dataset to another be allowed? What should happen here?

Dataset data = dump["val"];
data = dump["another"];

I think a reasonable interpretation is that "val" in the file is modified to have the same value and attributes as "another". Someone could also read it as just changing data to refer to a different variable in the file, not modifying any data.

ZedThree commented 3 years ago

I've started looking at this in earnest now, and there are some design decisions we need to make. This post is quite long, but there's a fair bit to consider, and if we're going to break things, it's probably best to make sure we do think about everything!

To summarise some downsides of the current approach:

DataFile is quite painful to touch: there's a lot of copy-pasting required, and things need to be repeated across DataFile, DataFormat and the specific implementations
- We could mitigate a lot of this with a bunch of templating and use of variant -- a refactoring, but keep the same API
dump.addOnce(field, "field") may end up either reading and/or writing field to/from dump depending on whether dump.read() or dump.write() gets called next. This is a bit confusing to read, and plausibly a bug waiting to happen (?)
- I don't think this is easily fixed without a major rewrite and change to API
The following looks like it should work, but doesn't:
```
for (Field3D& field : vector_of_fields) {
  dump.addOnce(field, field.name);
  dump.write();
}
```
All the fields in the file get overwritten by the last field added. This is because we store a pointer to the object to be written, and implicitly require its lifetime to be longer than the last call to write as we write out everything each time
- Plausible this could be fixed? Not sure it could be done without a major rewrite though. Maybe just removing non-evolving variables from the internal vectors after a call to write?
It's not possible to write a const value without a const_cast: gross :nauseated_face:
- This could be fixed by storing a separate list of write-only variables, but that sounds very painful to maintain
No support of groups. This would be useful for saving the tree of options in the dump file (#2099) as well as using groups to write out the inputs for each started/restarted run (see https://github.com/boutproject/BOUT-dev/pull/2149#discussion_r527894529).
- Could be added, but would need a major rewrite, including possible change to API (?)

Given that fixing basically any one of these requires significant changes, let's consider a complete rewrite, including breaking changes to the API and physics models. It would be nice if we could mechanically upgrade physics models to use whatever replacement we decide on.

We have one option already OptionsNetCDF: I think this evolved out of Ben's proposal above. This basically reads an entire netCDF file into a new Options instance, or takes an Options and writes it to a file immediately. I've started attempting to replace DataFile with it, and it definitely looks like an improvement.

Interface is very simple:

 // Write a Field3D to file
 Options write_options;
 write_options["field"] = Field3D{1.0};
 OptionsNetCDF{filename}.write(write_options);

 // Read the Field3D back
 Options read_options = OptionsNetCDF{filename}.read();
 auto field = read_options["field"].as<Field3D>();

A lot of the machinery is in Options
Supports groups
No worries about object lifetimes as everything is copied into an Options instance
- Likely more expensive than DataFile, but maybe not noticeable compared to the IO cost?
Time evolving variables are a little awkward to create:
```
 options["field"].attributes["time_dimension"] = "t";
```
This could be solved with some helper method/function:
```
 options["field"].time_evolving(true);
```
Although the first way is more generic if we want to have multiple different time dimensions (useful for calculating diagnostics on a different time scale?)
Already exists!
Writes the entire file every time which may be unnecessarily expensive -- but so does the current implementation, so likely not worse?

Another option might be an interface that immediately reads/writes values, rather than a whole Options tree.

dump.write("field", Field3D{1.0});
Field3D field = dump.read("field");

We'd probably want to use a variant to store the different types we use, which means we're very likely to want to reuse a fair of the code in Options for handling conversions. Now that I think about it, this could probably easily work with OptionsNetCDF:

template <typename T> // but constrained to Options::ValueType members?
void OptionsNetCDF::write(const std::string& name, T value) {
  Options options;
  options[name] = std::move(value); // still involves a copy, just in the function signature
  write(options);
}

template <typename T>
T OptionsNetCDF::read(const std::string& name) {
  return read()[name]; // pretty sure this still has a copy
}

In either case, we have a slight stumbling block when it comes to writing auxiliary time-evolving variables (that is, variables not added to the solver). For time-evolving variables, Solver::outputVars can do the copying/writing without changing user code. PhysicsModel already has a virtual postInit method that can be overridden by the user -- however to use it effectively would require the user to make sure they call PhyiscsModel::postInit() in order to handle the restart file (or copy that code, of course). We might prefer to add a new virtual method that allows the user to output any other variables they like. Such a method would end up looking like:

void MyModel::outputVars() override {
  // OptionsNetCDF API
  Options options;     // could be passed in or a member of PhysicsModel?
  options["field"] = Field3D{1.0};
  options["field"].attributes["time_dimension"] = "t";
  dump.write(options); // or alternatively return options and have this call within the library?

  // Alternative, immediate-write API
  dump.write("field", Field3D{1.0}, DumpFile::time_evolving); // Needs an argument for _which_ time dimension?
}

I think it should be possible to scan a user's physics model source code and move dump.add() calls to such a function and translate them to a new API -- assuming we could work out which calls are intended to be writes, rather than reads!

A variation on the above is to pass an Options into PhysicsModel::init/rhs and have the user add variables to that (possibly returning the Options to be passed to the solver? This would definitely be a bigger refactor). This might be much worse though, as it would involve copying into the Options every internal timestep. The first option of a separate method would only need to be called every output timestep.

Other things to consider:

Now that it's much easier to compile with the netCDF C++ API, we're definitely going to drop the legacy C++ interface. How important is it we keep an HDF5 implementation? I get the feeling it's not really used. I'm pretty sure the parallel implementations are broken, and we don't have effective tests for them.
I'm assuming we'd prefer to keep the actual output file backwards compatible as much as possible, even if we break compatibility for the framework API. I don't see any breaking changes that would be useful there, but maybe there are some?
If we implement the suggestion here https://github.com/boutproject/BOUT-dev/pull/2149#discussion_r528571096 to save the inputs for each restart, we'll need some way of working out the last run number and incrementing it -- I don't think this is a problem, but it would involve adding functionality to OptionsNetCDF at least
Using OptionsNetCDF might mean PhysicsModel needs both an Options instance, as well as an OptionsNetCDF. It might be easier then if OptionsNetCDF has an Options member that can be used directly? Not sure about this.

Phew! I'm sure I've still missed something important. I'm very interested to people's thoughts on any of this. I think I'm going to press on trying to replace DataFile with OptionsNetCDF to see if any other issues crop up.

johnomotani commented 3 years ago

Scattered thoughts:

I'd agree with a re-write. I think a new (or at least) extended API with explicit reads and writes would be useful.
I wrote the HDF5 implementation, and I've never used it! I'd vote for getting rid of it for the usual reasons: xarray likes netCDF; as long as we think named dimensions are a good thing rather than a bad one (I do), then HDF5 has no advantages over netCDF (as far as I know).
If there's ever an effort to add parallel I/O, it'd probably still be nice to have the option to switch between the current scheme and parallel writes (e.g. to debug guard cell issues). Don't know if/how this affects the design of OptionsNetCDF (maybe OptionsNetCDF should be an interface, with a factory to create particular implementations at run-time?).
Output on different time dimensions is useful - there's a (I think now slightly upgraded) version of FastOutput included in STORM that's being used for synthetic Langmuir probe signals.
I like the write-immediately for non-time dependent variables (much more intuitive!). If we are writing immediately, is the copy necessary? Could we not just use a const reference?
Would it be useful to have a separate method to add time-dependent variables? It seems like it might be useful to have a special list of them (or list per time dimension?) somewhere so we don't have to search the whole options tree when writing per-timestep output. Probably that'd rarely be an issue but if running a 1d/2d code with very high output frequency it might impact the run time.
I agree with the point about not requiring a persistent variable to write anything, but maybe for time-dependent variables it'd be OK (or even simpler) to have a reference/pointer rather than a copy stored in the Options object (I seem to be arguing again for special treatment of time-evolving variables). Most time-evolving things would have to have a persistent reference anyway... I suppose you could (for debugging) add a temporary variable in rhs and it'd be useful to make a copy into Options. Maybe we could have separate methods like addTimeDependent() and addTimeDependentCopy()?
It'd be nice to have (at least optionally) separate sets of input and output restart files - that would allow for example putting a set of restart files in a directory and then starting several different simulations (e.g. a parameter scan) from them, without having to copy the files just to have them instantly over-written. If input and output location were the same, I imagine we'd keep the current behaviour (overwriting).
One thing that could be backward-incompatible would be if we move some variables (e.g. metrics or time-independent variables (e.g. a source term which might well be changed when restarting) into a per-restart group. Unless we kept two copies around... At least for xBOUT users we should easily be able to work around this, so analysis scripts could work with either old (pre-v5) or new (v5) style outputs with no changes.
- on copies: HDF5 allows links (section 2.2 here https://support.hdfgroup.org/HDF5/doc1.6/UG/09_Groups.html), but a quick search doesn't show netCDF supporting them...

johnomotani commented 3 years ago

We might need to be a bit careful about exactly when some quantities are written. For example, in STORM we normalise the metric components in the PhysicsModel::init() method, so writing of Coordinates quantities ideally should happen after init() is called (rather than, e.g. in the Coordinates constructor, which would seem like a natural place).

ZedThree commented 3 years ago

Thanks John, some really useful points to think about there. Some more scattered thoughts in response!

For a parallel I/O option: now we've got generic factories, converting to an interface+factory shouldn't be too difficult, although it might break some uses if we do it later. Pretty sure we can safely defer any decision here though. We can always hide it in another layer, similar to the current approach of DataFile/DataFormat.
If we write a variable immediately, then it would be nice to avoid the copy completely. That's not possible with the existing OptionsNetCDF. It relies on Options to handle converting to the correct type. If we wrote something new, I'd probably want to pull out those bits into some other helper class/functions -- it's basically the Options::as<> specialisations, about 350 lines. At first blush I thought this might be quite easy, but thinking again, to avoid a copy we'd want to move the data out of the variant, so this might be a little subtle. I think I can see a way to do this though.
Ah, I did forget one important point about copying! Our expensive types are actually all copy-on-write, so actually one might think that the excess copying isn't so bad:
```
Field3D f{1.0};
Options options;
options["field3d"] = f; // copy-on-write, so cheaper than copying the full data block
f = DDX(g);  // swap data block with result variable, also cheap
```
I was worried about options["field3d"] keeping an extra reference to the original data block hanging round and so cause Field3D::allocate/Array::ensureUnique to trigger an expensive copy, but thinking through this, that worry is probably unfounded. Matrix/Tensor have the copy-on-write disabled, so this would be expensive for them, but I rather suspect people are not writing those out as time-dependent variables.

If we always write variables immediately, using a different method for time-dependent variables is basically just a change of spelling:

dump.write("non-evolving variable", Field3D{1.0});
// could spell time-evolving variable on default time dimension with an argument
// (here using an enum to name it, rather than bare bool)
dump.write("evolving variable", Field3D{1.0}, TimeDependent::on, "t");
// or with different method
dump.writeTimeDependent("evolving variable", Field3D{1.0}, "t");

I prefer the first spelling, but not enough to veto the second :) If we go with the OptionsNetCDF approach, we could add another argument to Options::assign to add the "time_dimension" attribute more easily:

options["n"].assign(n, TimeDependent::on)`; // again, could be spelled as a different method

I really not keen on keeping pointers/references to time-dependent variables in Options. I don't think they're necessary in either of the two proposed approaches:
- If we go with the immediate-write approach pointers are unnecessary as we'd always have exactly the correct number of calls to dump.write for both time-independent and time-dependent variables. Variables sort of "obviously" have to have the correct lifetime: either the call to dump.write() is in the same scope as the variable, or it's a compile-time error.
- If we go with the current OptionsNetCDF approach, the copying actually becomes a benefit, because you can declare (and define) a variable in the smallest scope:
```
int rhs(bool) {
ddt(n) = ...
auto n_diagnostic = diagnostic(n);
output_file_options["n_diagnostic"] = n_diagnostic; // copy means n_diagnostic can go out of scope
}
```
  Here I've assumed that output_file_options is a member of PhysicsModel (could also be passed in!), and we've got a call to OptionsNetCDF::write(output_file_options) somewhere else. And this is true also if we go with a separate PhysicsModel::outputVars method (name subject to discussion!), or if you calculate something in a Monitor. Again, either a variable is sort of "obviously" in correct scope because output_file_options is visible, or it's a compile time error.
A potential issue with OptionsNetCDF that I've just thought of is that we would need to be careful to pass the Options round by reference, and not accidentally take copies of it.
OptionsNetCDF being able to write a whole Options tree at once is really nice, but the immediate-write approach is also very appealing. My personal preference at the minute is to combine them, and add immediate-read/write methods to OptionsNetCDF that avoid copying. This would give us the best of both worlds. I'm not sure how possible this is without code duplication, given the current use of variants -- it may not be possible to avoid copying without some code duplication.

ZedThree commented 3 years ago

So a pretty severe limitation of netCDF is not being able to get the current extent of individual variables, only the maximum extent of an unlimited dimension (i.e. time). I've opened an issue about it, maybe they will add something -- but even if they do, we would then rely on a bleeding edge version of netCDF.

I've bumped into this problem again due to keeping an OptionsNetCDF instance around for the immediate-write API. Ben's original workaround for the above netcdf issue was to have a map<int, size_t> to cache the size of the time dimension. This works when writing an entire Options to file using a temporary OptionsNetCDF, basically only this idiom:

OptionsNetCDF(filename).write(options);

and the following won't work:

OptionsNetCDF file(filename);
file.write(options);
file.write(options);

I can see two potential ways around this:

Drop down to HDF5. If we additionally rely on HDF5 (and we would sort of do implicitly as netcdf-cxx4 requires it), we can use H5Sget_simple_extent_dims to get this information. Although that has a limitation of needing to close the file with netcdf first.
Keep track of the last index written in an attribute.

I've opted for the second option, which results in the time-dependent variable writing bit looking like this:

          // NetCDF doesn't keep track of the current for each
          // variable (although the underlying HDF5 file does!), so we
          // need to do it ourselves. We'll use an attribute in the
          // file to do so, which means we don't need to keep track of
          // it in the code
          int current_time_index;
          const auto atts_map = var.getAtts();
          const auto it = atts_map.find("current_time_index");
          if (it == atts_map.end()) {
            // Attribute doesn't exist, so let's start at zero. There
            // are various ways this might break, for example, if the
            // variable was added to the file by a different
            // program. But note, that if we use the size of the time
            // dimension here, this will increase every time we add a
            // new variable! So zero is probably the only sensible way
            // to do this
            current_time_index = 0;
          } else {
            it->second.getValues(&current_time_index);
          }

          std::vector<size_t> start_index; ///< Starting index where data will be inserted
          std::vector<size_t> count_index; ///< Size of each dimension

          // Dimensions, including time
          for (const auto& dim : dims) {
            start_index.push_back(0);
            count_index.push_back(dim.getSize());
          }
          // Time dimension
          start_index[0] = current_time_index;
          count_index[0] = 1; // Writing one record

          fmt::print("Writing {} at time {}\n", name, current_time_index);

          // Put the data into the variable
          bout::utils::visit(NcPutVarCountVisitor(var, start_index, count_index),
                             child.value);

          // Make sure to update the time index, including incrementing it!
          var.putAtt("current_time_index", ncInt, ++current_time_index);

This will break if someone adds a time-dependent variable to the file without adding the current_time_index attribute. We'll just start overwriting it from the beginning index.

The first option is likely more robust, but as we'll need to close the file first, I think that might be quite complicated in practice.

ZedThree commented 3 years ago

For context, the current Ncxx4 implementation of Dataformat tracks the latest time index for each variable with a map in the Ncxx4 instance (whereas H5Format can just query this information). Every time that Ncxx4::write_rec is called for a variable, it increments the corresponding index.

After chatting with @johnomotani, we realised that one problem with my proposed method above is that it's possible for different variables to get out of sync, whereas with Ncxx4, this isn't possible because we are forced to write all the variables at once.

John suggested a dump.extend_dimension("t") method that extends the dimension by one, and then the write method always writes to the latest index of that dimension. We would just then require that there be exactly one call to extend_dimension per timestep. Missing the call would result in overwriting the data, an extra call would result in an "empty" timestep. For the default time dimension, we'd call it in the library so the user would only need to call it for different frequency time dimensions. That sounds reasonable to me, but I'd be interested in hearing other viewpoints.

A separate thought, but one I want to write down somewhere -- we have a time dimension t which is timestep index, and a separate variable t_array which is simulation time -- should we just make t == t_array? I'm not sure what it breaks if we do, but might be interesting to do

johnomotani commented 3 years ago

A separate thought, but one I want to write down somewhere -- we have a time dimension t which is timestep index, and a separate variable t_array which is simulation time -- should we just make t == t_array? I'm not sure what it breaks if we do, but might be interesting to do

I think a netCDF dimension is only an array-size: the 'dimension' has to be a list of consecutive integers starting from 0. https://www.unidata.ucar.edu/software/netcdf/docs/group__dimensions.html#details So it couldn't represent t_array, which is an array of floats.

ZedThree commented 3 years ago

That's the dimension IDs, rather than the dimensions themselves. It turns out if you just define a dimension like we currently do, netCDF basically just makes an empty dataset. If you then also define a variable with the same name, you can then give it values. Here's one of their examples demonstrating this: https://www.unidata.ucar.edu/software/netcdf/docs/pres__temp__4D__wr_8c_source.html

Thinking a little bit more about keeping track of the last time index for each variable -- instead of needing to call dump.extend_dimension("t") at the start of each output timestep, what about having a dump.finish_timestep("t") that checks all variables with the t dimension have the same size? Forgetting to call it wouldn't introduce any new problems, although it wouldn't be able to catch any. For the main dump files, we can probably do this in the library to completely automate it.

johnomotani commented 3 years ago

That's the dimension IDs, rather than the dimensions themselves. It turns out if you just define a dimension like we currently do, netCDF basically just makes an empty dataset. If you then also define a variable with the same name, you can then give it values. Here's one of their examples demonstrating this: https://www.unidata.ucar.edu/software/netcdf/docs/pres__temp__4D__wr_8c_source.html

Ah, OK - it's a convention https://www.unidata.ucar.edu/software/netcdf/workshops/2011/datamodels/NcCVars.html (a very logical one!), and a convention that's used by xarray. In xbout we already rename t_array to t when loading for exactly this reason.

johnomotani commented 3 years ago

Thinking a little bit more about keeping track of the last time index for each variable -- instead of needing to call dump.extend_dimension("t") at the start of each output timestep, what about having a dump.finish_timestep("t") that checks all variables with the t dimension have the same size? Forgetting to call it wouldn't introduce any new problems, although it wouldn't be able to catch any. For the main dump files, we can probably do this in the library to completely automate it.

That seems sensible to me as long as the main library does take care of the checking for the standard output t dimension. Maybe dump.finish_timestep("t") could return the current length of the dimension, then the main library could check it against the iteration count for extra safety?

If a user (or e.g. FastOuput) want to define a different unlimited dimension and append to it, they can be responsible for adding the check themselves. finish_timestep() seems like a good plan - check can be included for 'production code' but doesn't matter too much if it's skipped for quick debugging, etc. :+1:

BTW I just found out (https://www.unidata.ucar.edu/software/netcdf/docs/unlimited_dims.html) that in netCDF-4 files you're allowed any number of unlimited dimensions, but in netCDF 'classic' you can only have one - another reason to only support netCDF-4 in v5.

ZedThree commented 3 years ago

I've made some decent progress on this, but there's (at least) three points that need a bit of thinking about:

What should the new API look like?
Who owns the output file?
What's the replacement for dump.addRepeat?

For the first question, my current design adds:

  /// Output additional variables other than the evolving variables
  virtual void outputVars(MAYBE_UNUSED(Options& options)) {}
  /// Add additional variables other than the evolving variables to the restart files
  virtual void restartVars(MAYBE_UNUSED(Options& restart)) {}

to PhysicsModel. These get called in PhysicsModel::postInit and PhysicsModel::PhysicsModelMonitor::call. This means auxiliary variables could be calculated only when needed.

I'm hoping this will make it reasonably easy to automate an upgrade. dump.add calls get moved to outputVars and translated to options[name] = var, while restart.add calls get moved to restartVars and translated.

Does the separation into two methods make sense?

For the second point, currently PhysicsModel owns the output file, and either PhysicsModel::PhysicsModelMonitor::call calls the outputVars method on other objects to populate the Options to write out; or for the monitors, there's now Solver::writeToModelOutputFile(const Options& options) which immediately writes options to the output file.

Solver::writeToModelOutputFile is needed because the monitors (e.g. BoutMonitor which writes timing informaiton) only have access to the solver, and not to the model.

Currently, it is definitely easiest for either the model or the solver to own the output file, as they have pointers to each other, and can easily call mesh and coordinates methods. Ideally, I think there'd be a Simulation class that owned the mesh, solver, model, and output file. Currently we can't (easily) pass in the output file to the physics model because most (all?) user models don't have a constructor. We could add a physics model constructor that took the output file (and anything else we want), if we simply added

using PhysicsModel::PhysicsModel;

to user models. This is definitely possible to automate, and we're already have a physics model upgrader tool.

I'm trying to encapsulate things enough that such a refactoring wouldn't be too hard.

dump.addRepeat(field, "field") currently translates to:

options["field"] = field;   // Need to use .force() for ints and BoutReals
options["field"].attributes["time_dimension"] = "t";

which is both a bit unwieldy, and has more surface area for mistakes. I keep coming back to something like:

template <class T>
Options::assignRepeat(T value, std::string time_dimension = "t", std::string source = "") {
  force(value, source);
  attributes["time_dimension"] = time_dimension;
}

options["field"].assignRepeat(field);

which I'm not overly happy with, but might be the best we can do? There's still the risk of mixing up time_dimension and source -- not sure which way round is best.

I'd love to hear peoples thoughts on any of these points.

ZedThree commented 3 years ago

A complication for automatic upgrading: some variables are only conditionally written. Here's an example from test-drift-instability:

    if (evolve_ajpar) {
      solver->add(Ajpar, "Ajpar");
      comms.add(Ajpar);
      output.write("ajpar\n");
    } else {
      initial_profile("Ajpar", Ajpar);
      if (ZeroElMass) {
        SAVE_REPEAT(Ajpar); // output calculated Ajpar
      }
    }

That's basically going to be impossible to correctly translate with regexp. There is a clang Python API that might make it possible, but it's still probably going to be tricky.

I think @bendudson suggested storing pointers in PhysicsModel similar to the way DataFile currently works, as a sort of stop-gap measure -- looks like that might be necessary.

bendudson commented 3 years ago

Thanks @ZedThree I think there might be a way to keep some kind of backward compatibility:

Create a new class with a similar interface to DataFile, and make an instance of it (dump) in the PhysicsModel class


class DataFileFacade {
...
saveRepeat(...);
saveOnce(...);
/// Store pointers to variables as DataFile does now
};

class PhysicsModel { ... DataFileFacade dump; }


2. Use a virtual function with a default implementation which returns the data to be written, taken from the `DataFileFacade` object:

class PhysicsModel { ... virtual Options outputData() { Options options; for name, varptr in dump { options[name] = *var; } return options; } };


So if a user wanted to write different things to output, they could implement this `outputData` method, but the old method would also work: A call to `SAVE_REPEAT(Ajpar)` would store a pointer in `DataFileFacade dump`, which would then be put into the `Options` tree in `outputData`. 

I'm undecided whether passing in an `Options&` to modify would be better than constructing and returning an `Options`, but think this way might be cleaner.

One thing I'm not sure about is how to add multiple monitors. A different virtual function for high frequency output (`outputDataFast`?), or an argument to `outputData` with some kind of tag to indicate which monitor it is?

ZedThree commented 3 years ago

That looks good! I'll have a go at implementing that. One slightly annoying thing is that we basically want the Options::value_type variant, but pointers instead of values. That would simplify the DataFileFacade a lot. It doesn't look like there's a way to transform a variant or iterate over its types, which is a pity, so it'll just have to duplicate the type list.

I'm undecided whether passing in an Options& to modify would be better than constructing and returning an Options, but think this way might be cleaner.

We currently don't have a way to "flat-merge" Options (if that makes sense), so I'm currently passing in the Options& to be modified.

bendudson commented 3 years ago

I would also like to see a way to decouple the PhysicsModel and Solver classes, which currently have pointers to each other (bear with me, I think it's sort-of related). I think it should be possible for the PhysicsModel to not have a pointer to Solver, but rather have one method which gets the state (probably as an Options tree :)) and the rhs which takes the state as input. The existing calls to SOLVE_FOR etc. would register pointers in something like DataFileFacade, allowing backward compatibility.

bendudson commented 3 years ago

Ah yes, I guess a flat-merge would be needed at the moment. Eventually it would be nice to have each PhysicsModel output in a tree, but XArray doesn't handle groups? A flat merge should be fairly straightforward to write, but I'm not sure how (in-)efficient it would be.

ZedThree commented 3 years ago

I think the issue is that xarray doesn't read groups out of the box, but if you know what groups to read, it works?

But yes, I would also like more structure in the output file! We were talking about saving the inputs from each restart in some group structure too.

bendudson commented 3 years ago

Had a thought about the different monitors: When making quantities in the output time-dependent, the label of the time coordinate is specified. The outputData function (or whatever it's called) could be passed the name of the time coordinate being extended?

Options outputData(const string& time_dimension);

The default is "t", but if a monitor is created which outputs at a different rate, it would use a different time dimension?

ZedThree commented 3 years ago

Ah good point! We'll need to keep track of time_dimension somehow -- it will either need to be stored in the Monitor base class, or Solver stores a vector of struct MonitorInfo { Monitor* monitor; std::string time_dimension; } instead of just Monitor*. Either one is doable.

Also, I realised that merging Options isn't essential -- as long as the OptionsNetCDF is appending, we can just write the Options result by itself, no need to wait till we've got one big Options to write all at once. It's possible there are performance considerations, but it is possible.

johnomotani commented 3 years ago

Ah yes, I guess a flat-merge would be needed at the moment. Eventually it would be nice to have each PhysicsModel output in a tree, but XArray doeson't handle groups?

xarray's open_dataset() and open_mfdataset() functions take a group argument so they can open a single group from a netcdf file. If we have multiple groups, we'd probably want to load all of them and merge them into a single xarray Dataset (e.g. if we have separate groups for 'Coordinates' and 'PhysicsModel variables', we'd still want all the variables to be defined on the same dimensions so we can combine them together in various ways). At the moment that would mean opening the files once for each group that we need to read, which is doable (in xBOUT) but probably not the most efficient.

There's been an on-and-off debate on xarray about how to handle groups - as far as I understand it's not resolved because there's not an obvious solution that works for every use case. We could propose some new functionality for xarray to make things more efficient though, especially if we have a workaround that's acceptable in the meantime. For example, being able to load and merge multiple groups (optionally with some prefix on the variable names or some other way of recording which variables belong to which group) while only opening the file once would go a long way to letting us do (efficiently) whatever we want with xBOUT. That seems generic enough that I think the xarray devs might be open to a PR adding it, but it'd probably take at least a day or two working on it, so I'd like to be sure of what we want/need before starting.

ZedThree commented 3 years ago

The replacement of DataFile with OptionsNetCDF for the main dump file is almost complete. Still need to replace it in GridFile, but I'm hoping that's fairly self-contained and easy. A refactoring of Mesh::get is sort of orthogonal to this.

One thing I've ignored so far is the various options to DataFile:

  OPTION(opt, parallel, false); // By default no parallel formats for now
  OPTION(opt, flush, true);     // Safer. Disable explicitly if required
  OPTION(opt, guards, true);    // Compatible with old behavior
  OPTION(opt, floats, false); // High precision by default
  OPTION(opt, openclose, true); // Open and close every write or read
  OPTION(opt, enabled, true);
  OPTION(opt, init_missing, false); // Initialise missing variables?
  OPTION(opt, shiftoutput, false); // Do we want to write 3D fields in shifted space?
  OPTION(opt, shiftinput, false); // Do we want to read 3D fields in shifted space?
  OPTION(opt, flushfrequency, 1); // How frequently do we flush the file

Of these, I've only kept enabled so far, and not as a general option, just specifically for the output and restart files.

For the others:

flush, flushfrequency, openclose, and parallel are all definitely gone: we now always open and close the file on each write, and there's no parallel implementation.
init_missing can be replaced with Options::withDefault
floats can probably just go? I've had to tweak the tolerance on some of the tests that have benchmark files that use floats. This could probably be implemented, if really necessary
it doesn't look like guards does anything

Which just leaves shiftinput/shiftoutput. These are probably useful? At least shiftinput could be necessary in order to read older files?

johnomotani commented 3 years ago

shiftinput is only useful (I think) for people moving from v3 to v5, although I guess there might be some people from LLNL or Japan who want to do that. If it's not super easy to implement, it'd probably be straightforward to make a pre-processing tool with xBOUT.

I'm not sure what guards was intended for initially, but it could be useful to have the option to not save the guard cells. It would save quite a lot of disk space, and might also speed up collecting data because we'd be reading contiguous arrays rather than needing to slice of guard cells. If we do implement this, I vote for always including boundary cells, even if we drop ~~boundary~~guard cells.

ZedThree commented 3 years ago

Yeah, guards does sound useful -- always including boundaries would be essential. Sounds like that could wait for another PR though :)

Ok, next stumbling block: GridFromFile. The bulk of the complexity here is in calculating the offsets to read the appropriate hyperplane from file for a Field2D/3D/Perp on a given processor.

Off the top of my head, there's two routes we could go down:

Read the entire grid file into an Options with OptionsNetCDF, and then take the appropriate hyperplanes from the Matrix/Tensor
Move the hyperplane functionality into OptionsNetCDF

The first option is probably easy to implement, but would mean we necessarily have to read in the entire gridfile on every processor.

The second option is probably nicer in terms of memory use, etc., and is probably more amenable to eventually using parallel netCDF down the line again perhaps? But poses more questions in terms of API.

Maybe just an overload of read that takes a const Mesh&?

Options data = OptionsNetCDF{filename}.read(mesh);

(maybe optional rank too for testing?)

ZedThree commented 3 years ago

Oh, the other difficulty with option 2 is that when we read in the data in OptionsNetCDF, we read 1D data into an Array, 2D data into Matrix, and 3D data into Tensor, so we can't distinguish between Field2D and FieldPerp. That much at least is necessary for choosing the right offsets and extents.

There's also the question of older files needing FFTs.

bendudson commented 11 months ago

This is now done (#2209), and supports both NetCDF and ADIOS2 output (#2766). The next step will probably be to remove the DataFileFacade backward-compatibility layer.