Closed bendudson closed 11 months ago
Quickly put together a minimal working version of an interface in branch "new-fileio".
Header file code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/include/bout/fileio.hxx
Example code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/examples/fileio/test-fileio.cxx
I quite like being able to read data using int value = dump["val"];
, but am less sure about dump["val"] = 3;
. It's concise, but it's assigning an int
to a Dataset
, so what should happen when a Dataset
is assigned to another Dataset
? Something more explicit might be
dump["val"].setValue(3);
which is explicit but a bit ugly, or maybe have another class (DatasetValue
?) so that we write:
dump["val"].value = 3;
...
int value = dump["val"].value;
Perhaps that might be clearer, since Dataset
will also contain attributes, dimensions etc.
Should assigning one Dataset
to another be allowed? What should happen here?
Dataset data = dump["val"];
data = dump["another"];
I think a reasonable interpretation is that "val" in the file is modified to have the same value and attributes as "another". Someone could also read it as just changing data
to refer to a different variable in the file, not modifying any data.
I've started looking at this in earnest now, and there are some design decisions we need to make. This post is quite long, but there's a fair bit to consider, and if we're going to break things, it's probably best to make sure we do think about everything!
To summarise some downsides of the current approach:
DataFile
is quite painful to touch: there's a lot of copy-pasting required, and things need to be repeated across DataFile
, DataFormat
and the specific implementations
variant
-- a refactoring, but keep the same APIdump.addOnce(field, "field")
may end up either reading and/or writing field
to/from dump
depending on whether dump.read()
or dump.write()
gets called next. This is a bit confusing to read, and plausibly a bug waiting to happen (?)
for (Field3D& field : vector_of_fields) {
dump.addOnce(field, field.name);
dump.write();
}
All the fields in the file get overwritten by the last field added. This is because we store a pointer to the object to be written, and implicitly require its lifetime to be longer than the last call to write
as we write out everything each time
write
?const
value without a const_cast
: gross :nauseated_face:
Given that fixing basically any one of these requires significant changes, let's consider a complete rewrite, including breaking changes to the API and physics models. It would be nice if we could mechanically upgrade physics models to use whatever replacement we decide on.
We have one option already OptionsNetCDF
: I think this evolved out of Ben's proposal above. This basically reads an entire netCDF file into a new Options
instance, or takes an Options
and writes it to a file immediately. I've started attempting to replace DataFile
with it, and it definitely looks like an improvement.
Interface is very simple:
// Write a Field3D to file
Options write_options;
write_options["field"] = Field3D{1.0};
OptionsNetCDF{filename}.write(write_options);
// Read the Field3D back
Options read_options = OptionsNetCDF{filename}.read();
auto field = read_options["field"].as<Field3D>();
Options
Options
instance
DataFile
, but maybe not noticeable compared to the IO cost? options["field"].attributes["time_dimension"] = "t";
This could be solved with some helper method/function:
options["field"].time_evolving(true);
Although the first way is more generic if we want to have multiple different time dimensions (useful for calculating diagnostics on a different time scale?)
Another option might be an interface that immediately reads/writes values, rather than a whole Options
tree.
dump.write("field", Field3D{1.0});
Field3D field = dump.read("field");
We'd probably want to use a variant
to store the different types we use, which means we're very likely to want to reuse a fair of the code in Options
for handling conversions. Now that I think about it, this could probably easily work with OptionsNetCDF
:
template <typename T> // but constrained to Options::ValueType members?
void OptionsNetCDF::write(const std::string& name, T value) {
Options options;
options[name] = std::move(value); // still involves a copy, just in the function signature
write(options);
}
template <typename T>
T OptionsNetCDF::read(const std::string& name) {
return read()[name]; // pretty sure this still has a copy
}
In either case, we have a slight stumbling block when it comes to writing auxiliary time-evolving variables (that is, variables not added to the solver). For time-evolving variables, Solver::outputVars
can do the copying/writing without changing user code. PhysicsModel
already has a virtual postInit
method that can be overridden by the user -- however to use it effectively would require the user to make sure they call PhyiscsModel::postInit()
in order to handle the restart file (or copy that code, of course). We might prefer to add a new virtual method that allows the user to output any other variables they like.
Such a method would end up looking like:
void MyModel::outputVars() override {
// OptionsNetCDF API
Options options; // could be passed in or a member of PhysicsModel?
options["field"] = Field3D{1.0};
options["field"].attributes["time_dimension"] = "t";
dump.write(options); // or alternatively return options and have this call within the library?
// Alternative, immediate-write API
dump.write("field", Field3D{1.0}, DumpFile::time_evolving); // Needs an argument for _which_ time dimension?
}
I think it should be possible to scan a user's physics model source code and move dump.add()
calls to such a function and translate them to a new API -- assuming we could work out which calls are intended to be writes, rather than reads!
A variation on the above is to pass an Options
into PhysicsModel::init/rhs
and have the user add variables to that (possibly returning the Options
to be passed to the solver? This would definitely be a bigger refactor). This might be much worse though, as it would involve copying into the Options
every internal timestep. The first option of a separate method would only need to be called every output timestep.
Other things to consider:
OptionsNetCDF
at leastOptionsNetCDF
might mean PhysicsModel
needs both an Options
instance, as well as an OptionsNetCDF
. It might be easier then if OptionsNetCDF
has an Options
member that can be used directly? Not sure about this.Phew! I'm sure I've still missed something important. I'm very interested to people's thoughts on any of this. I think I'm going to press on trying to replace DataFile
with OptionsNetCDF
to see if any other issues crop up.
Scattered thoughts:
addTimeDependent()
and addTimeDependentCopy()
?We might need to be a bit careful about exactly when some quantities are written. For example, in STORM we normalise the metric components in the PhysicsModel::init()
method, so writing of Coordinates
quantities ideally should happen after init()
is called (rather than, e.g. in the Coordinates
constructor, which would seem like a natural place).
Thanks John, some really useful points to think about there. Some more scattered thoughts in response!
For a parallel I/O option: now we've got generic factories, converting to an interface+factory shouldn't be too difficult, although it might break some uses if we do it later. Pretty sure we can safely defer any decision here though. We can always hide it in another layer, similar to the current approach of DataFile
/DataFormat
.
If we write a variable immediately, then it would be nice to avoid the copy completely. That's not possible with the existing OptionsNetCDF
. It relies on Options
to handle converting to the correct type. If we wrote something new, I'd probably want to pull out those bits into some other helper class/functions -- it's basically the Options::as<>
specialisations, about 350 lines. At first blush I thought this might be quite easy, but thinking again, to avoid a copy we'd want to move the data out of the variant, so this might be a little subtle. I think I can see a way to do this though.
Ah, I did forget one important point about copying! Our expensive types are actually all copy-on-write, so actually one might think that the excess copying isn't so bad:
Field3D f{1.0};
Options options;
options["field3d"] = f; // copy-on-write, so cheaper than copying the full data block
f = DDX(g); // swap data block with result variable, also cheap
I was worried about options["field3d"]
keeping an extra reference to the original data block hanging round and so cause Field3D::allocate/Array::ensureUnique
to trigger an expensive copy, but thinking through this, that worry is probably unfounded. Matrix
/Tensor
have the copy-on-write disabled, so this would be expensive for them, but I rather suspect people are not writing those out as time-dependent variables.
If we always write variables immediately, using a different method for time-dependent variables is basically just a change of spelling:
dump.write("non-evolving variable", Field3D{1.0});
// could spell time-evolving variable on default time dimension with an argument
// (here using an enum to name it, rather than bare bool)
dump.write("evolving variable", Field3D{1.0}, TimeDependent::on, "t");
// or with different method
dump.writeTimeDependent("evolving variable", Field3D{1.0}, "t");
I prefer the first spelling, but not enough to veto the second :)
If we go with the OptionsNetCDF
approach, we could add another argument to Options::assign
to add the "time_dimension"
attribute more easily:
options["n"].assign(n, TimeDependent::on)`; // again, could be spelled as a different method
I really not keen on keeping pointers/references to time-dependent variables in Options
. I don't think they're necessary in either of the two proposed approaches:
dump.write
for both time-independent and time-dependent variables. Variables sort of "obviously" have to have the correct lifetime: either the call to dump.write()
is in the same scope as the variable, or it's a compile-time error.OptionsNetCDF
approach, the copying actually becomes a benefit, because you can declare (and define) a variable in the smallest scope:
int rhs(bool) {
ddt(n) = ...
auto n_diagnostic = diagnostic(n);
output_file_options["n_diagnostic"] = n_diagnostic; // copy means n_diagnostic can go out of scope
}
Here I've assumed that output_file_options
is a member of PhysicsModel
(could also be passed in!), and we've got a call to OptionsNetCDF::write(output_file_options)
somewhere else. And this is true also if we go with a separate PhysicsModel::outputVars
method (name subject to discussion!), or if you calculate something in a Monitor
.
Again, either a variable is sort of "obviously" in correct scope because output_file_options
is visible, or it's a compile time error.
A potential issue with OptionsNetCDF
that I've just thought of is that we would need to be careful to pass the Options
round by reference, and not accidentally take copies of it.
OptionsNetCDF
being able to write a whole Options
tree at once is really nice, but the immediate-write approach is also very appealing. My personal preference at the minute is to combine them, and add immediate-read/write methods to OptionsNetCDF
that avoid copying. This would give us the best of both worlds. I'm not sure how possible this is without code duplication, given the current use of variants -- it may not be possible to avoid copying without some code duplication.
So a pretty severe limitation of netCDF is not being able to get the current extent of individual variables, only the maximum extent of an unlimited dimension (i.e. time). I've opened an issue about it, maybe they will add something -- but even if they do, we would then rely on a bleeding edge version of netCDF.
I've bumped into this problem again due to keeping an OptionsNetCDF
instance around for the immediate-write API. Ben's original workaround for the above netcdf issue was to have a map<int, size_t>
to cache the size of the time dimension. This works when writing an entire Options
to file using a temporary OptionsNetCDF
, basically only this idiom:
OptionsNetCDF(filename).write(options);
and the following won't work:
OptionsNetCDF file(filename);
file.write(options);
file.write(options);
I can see two potential ways around this:
Drop down to HDF5. If we additionally rely on HDF5 (and we would sort of do implicitly as netcdf-cxx4 requires it), we can use H5Sget_simple_extent_dims
to get this information. Although that has a limitation of needing to close the file with netcdf first.
Keep track of the last index written in an attribute.
I've opted for the second option, which results in the time-dependent variable writing bit looking like this:
// NetCDF doesn't keep track of the current for each
// variable (although the underlying HDF5 file does!), so we
// need to do it ourselves. We'll use an attribute in the
// file to do so, which means we don't need to keep track of
// it in the code
int current_time_index;
const auto atts_map = var.getAtts();
const auto it = atts_map.find("current_time_index");
if (it == atts_map.end()) {
// Attribute doesn't exist, so let's start at zero. There
// are various ways this might break, for example, if the
// variable was added to the file by a different
// program. But note, that if we use the size of the time
// dimension here, this will increase every time we add a
// new variable! So zero is probably the only sensible way
// to do this
current_time_index = 0;
} else {
it->second.getValues(¤t_time_index);
}
std::vector<size_t> start_index; ///< Starting index where data will be inserted
std::vector<size_t> count_index; ///< Size of each dimension
// Dimensions, including time
for (const auto& dim : dims) {
start_index.push_back(0);
count_index.push_back(dim.getSize());
}
// Time dimension
start_index[0] = current_time_index;
count_index[0] = 1; // Writing one record
fmt::print("Writing {} at time {}\n", name, current_time_index);
// Put the data into the variable
bout::utils::visit(NcPutVarCountVisitor(var, start_index, count_index),
child.value);
// Make sure to update the time index, including incrementing it!
var.putAtt("current_time_index", ncInt, ++current_time_index);
This will break if someone adds a time-dependent variable to the file without adding the current_time_index
attribute. We'll just start overwriting it from the beginning index.
The first option is likely more robust, but as we'll need to close the file first, I think that might be quite complicated in practice.
For context, the current Ncxx4
implementation of Dataformat
tracks the latest time index for each variable with a map
in the Ncxx4
instance (whereas H5Format
can just query this information). Every time that Ncxx4::write_rec
is called for a variable, it increments the corresponding index.
After chatting with @johnomotani, we realised that one problem with my proposed method above is that it's possible for different variables to get out of sync, whereas with Ncxx4
, this isn't possible because we are forced to write all the variables at once.
John suggested a dump.extend_dimension("t")
method that extends the dimension by one, and then the write
method always writes to the latest index of that dimension. We would just then require that there be exactly one call to extend_dimension
per timestep. Missing the call would result in overwriting the data, an extra call would result in an "empty" timestep. For the default time dimension, we'd call it in the library so the user would only need to call it for different frequency time dimensions. That sounds reasonable to me, but I'd be interested in hearing other viewpoints.
A separate thought, but one I want to write down somewhere -- we have a time dimension t
which is timestep index, and a separate variable t_array
which is simulation time -- should we just make t
== t_array
? I'm not sure what it breaks if we do, but might be interesting to do
A separate thought, but one I want to write down somewhere -- we have a time dimension t which is timestep index, and a separate variable t_array which is simulation time -- should we just make t == t_array? I'm not sure what it breaks if we do, but might be interesting to do
I think a netCDF dimension is only an array-size: the 'dimension' has to be a list of consecutive integers starting from 0.
https://www.unidata.ucar.edu/software/netcdf/docs/group__dimensions.html#details
So it couldn't represent t_array
, which is an array of floats.
That's the dimension IDs, rather than the dimensions themselves. It turns out if you just define a dimension like we currently do, netCDF basically just makes an empty dataset. If you then also define a variable with the same name, you can then give it values. Here's one of their examples demonstrating this: https://www.unidata.ucar.edu/software/netcdf/docs/pres__temp__4D__wr_8c_source.html
Thinking a little bit more about keeping track of the last time index for each variable -- instead of needing to call dump.extend_dimension("t")
at the start of each output timestep, what about having a dump.finish_timestep("t")
that checks all variables with the t
dimension have the same size? Forgetting to call it wouldn't introduce any new problems, although it wouldn't be able to catch any. For the main dump files, we can probably do this in the library to completely automate it.
That's the dimension IDs, rather than the dimensions themselves. It turns out if you just define a dimension like we currently do, netCDF basically just makes an empty dataset. If you then also define a variable with the same name, you can then give it values. Here's one of their examples demonstrating this: https://www.unidata.ucar.edu/software/netcdf/docs/pres__temp__4D__wr_8c_source.html
Ah, OK - it's a convention https://www.unidata.ucar.edu/software/netcdf/workshops/2011/datamodels/NcCVars.html (a very logical one!), and a convention that's used by xarray
. In xbout
we already rename t_array
to t
when loading for exactly this reason.
Thinking a little bit more about keeping track of the last time index for each variable -- instead of needing to call dump.extend_dimension("t") at the start of each output timestep, what about having a dump.finish_timestep("t") that checks all variables with the t dimension have the same size? Forgetting to call it wouldn't introduce any new problems, although it wouldn't be able to catch any. For the main dump files, we can probably do this in the library to completely automate it.
That seems sensible to me as long as the main library does take care of the checking for the standard output t
dimension. Maybe dump.finish_timestep("t")
could return the current length of the dimension, then the main library could check it against the iteration count for extra safety?
If a user (or e.g. FastOuput) want to define a different unlimited dimension and append to it, they can be responsible for adding the check themselves. finish_timestep()
seems like a good plan - check can be included for 'production code' but doesn't matter too much if it's skipped for quick debugging, etc. :+1:
BTW I just found out (https://www.unidata.ucar.edu/software/netcdf/docs/unlimited_dims.html) that in netCDF-4 files you're allowed any number of unlimited dimensions, but in netCDF 'classic' you can only have one - another reason to only support netCDF-4 in v5.
I've made some decent progress on this, but there's (at least) three points that need a bit of thinking about:
dump.addRepeat
?For the first question, my current design adds:
/// Output additional variables other than the evolving variables
virtual void outputVars(MAYBE_UNUSED(Options& options)) {}
/// Add additional variables other than the evolving variables to the restart files
virtual void restartVars(MAYBE_UNUSED(Options& restart)) {}
to PhysicsModel
. These get called in PhysicsModel::postInit
and PhysicsModel::PhysicsModelMonitor::call
. This means auxiliary variables could be calculated only when needed.
I'm hoping this will make it reasonably easy to automate an upgrade. dump.add
calls get moved to outputVars
and translated to options[name] = var
, while restart.add
calls get moved to restartVars
and translated.
Does the separation into two methods make sense?
For the second point, currently PhysicsModel
owns the output file, and either PhysicsModel::PhysicsModelMonitor::call
calls the outputVars
method on other objects to populate the Options
to write out; or for the monitors, there's now Solver::writeToModelOutputFile(const Options& options)
which immediately writes options
to the output file.
Solver::writeToModelOutputFile
is needed because the monitors (e.g. BoutMonitor
which writes timing informaiton) only have access to the solver, and not to the model.
Currently, it is definitely easiest for either the model or the solver to own the output file, as they have pointers to each other, and can easily call mesh and coordinates methods. Ideally, I think there'd be a Simulation
class that owned the mesh, solver, model, and output file. Currently we can't (easily) pass in the output file to the physics model because most (all?) user models don't have a constructor. We could add a physics model constructor that took the output file (and anything else we want), if we simply added
using PhysicsModel::PhysicsModel;
to user models. This is definitely possible to automate, and we're already have a physics model upgrader tool.
I'm trying to encapsulate things enough that such a refactoring wouldn't be too hard.
dump.addRepeat(field, "field")
currently translates to:
options["field"] = field; // Need to use .force() for ints and BoutReals
options["field"].attributes["time_dimension"] = "t";
which is both a bit unwieldy, and has more surface area for mistakes. I keep coming back to something like:
template <class T>
Options::assignRepeat(T value, std::string time_dimension = "t", std::string source = "") {
force(value, source);
attributes["time_dimension"] = time_dimension;
}
options["field"].assignRepeat(field);
which I'm not overly happy with, but might be the best we can do? There's still the risk of mixing up time_dimension
and source
-- not sure which way round is best.
I'd love to hear peoples thoughts on any of these points.
A complication for automatic upgrading: some variables are only conditionally written. Here's an example from test-drift-instability
:
if (evolve_ajpar) {
solver->add(Ajpar, "Ajpar");
comms.add(Ajpar);
output.write("ajpar\n");
} else {
initial_profile("Ajpar", Ajpar);
if (ZeroElMass) {
SAVE_REPEAT(Ajpar); // output calculated Ajpar
}
}
That's basically going to be impossible to correctly translate with regexp. There is a clang Python API that might make it possible, but it's still probably going to be tricky.
I think @bendudson suggested storing pointers in PhysicsModel
similar to the way DataFile
currently works, as a sort of stop-gap measure -- looks like that might be necessary.
Thanks @ZedThree I think there might be a way to keep some kind of backward compatibility:
PhysicsModel
class
class DataFileFacade {
...
saveRepeat(...);
saveOnce(...);
/// Store pointers to variables as DataFile does now
};
class PhysicsModel { ... DataFileFacade dump; }
2. Use a virtual function with a default implementation which returns the data to be written, taken from the `DataFileFacade` object:
class PhysicsModel { ... virtual Options outputData() { Options options; for name, varptr in dump { options[name] = *var; } return options; } };
So if a user wanted to write different things to output, they could implement this `outputData` method, but the old method would also work: A call to `SAVE_REPEAT(Ajpar)` would store a pointer in `DataFileFacade dump`, which would then be put into the `Options` tree in `outputData`.
I'm undecided whether passing in an `Options&` to modify would be better than constructing and returning an `Options`, but think this way might be cleaner.
One thing I'm not sure about is how to add multiple monitors. A different virtual function for high frequency output (`outputDataFast`?), or an argument to `outputData` with some kind of tag to indicate which monitor it is?
That looks good! I'll have a go at implementing that. One slightly annoying thing is that we basically want the Options::value_type
variant, but pointers instead of values. That would simplify the DataFileFacade
a lot. It doesn't look like there's a way to transform a variant or iterate over its types, which is a pity, so it'll just have to duplicate the type list.
I'm undecided whether passing in an Options& to modify would be better than constructing and returning an Options, but think this way might be cleaner.
We currently don't have a way to "flat-merge" Options
(if that makes sense), so I'm currently passing in the Options&
to be modified.
I would also like to see a way to decouple the PhysicsModel
and Solver
classes, which currently have pointers to each other (bear with me, I think it's sort-of related). I think it should be possible for the PhysicsModel
to not have a pointer to Solver
, but rather have one method which gets the state (probably as an Options
tree :)) and the rhs
which takes the state as input. The existing calls to SOLVE_FOR
etc. would register pointers in something like DataFileFacade
, allowing backward compatibility.
Ah yes, I guess a flat-merge would be needed at the moment. Eventually it would be nice to have each PhysicsModel
output in a tree, but XArray doesn't handle groups?
A flat merge should be fairly straightforward to write, but I'm not sure how (in-)efficient it would be.
I think the issue is that xarray doesn't read groups out of the box, but if you know what groups to read, it works?
But yes, I would also like more structure in the output file! We were talking about saving the inputs from each restart in some group structure too.
Had a thought about the different monitors: When making quantities in the output time-dependent, the label of the time coordinate is specified. The outputData
function (or whatever it's called) could be passed the name of the time coordinate being extended?
Options outputData(const string& time_dimension);
The default is "t", but if a monitor is created which outputs at a different rate, it would use a different time dimension?
Ah good point! We'll need to keep track of time_dimension
somehow -- it will either need to be stored in the Monitor
base class, or Solver
stores a vector of struct MonitorInfo { Monitor* monitor; std::string time_dimension; }
instead of just Monitor*
. Either one is doable.
Also, I realised that merging Options
isn't essential -- as long as the OptionsNetCDF
is appending, we can just write the Options
result by itself, no need to wait till we've got one big Options
to write all at once. It's possible there are performance considerations, but it is possible.
Ah yes, I guess a flat-merge would be needed at the moment. Eventually it would be nice to have each PhysicsModel output in a tree, but XArray doeson't handle groups?
xarray
's open_dataset()
and open_mfdataset()
functions take a group
argument so they can open a single group from a netcdf file. If we have multiple groups, we'd probably want to load all of them and merge them into a single xarray Dataset (e.g. if we have separate groups for 'Coordinates' and 'PhysicsModel variables', we'd still want all the variables to be defined on the same dimensions so we can combine them together in various ways). At the moment that would mean opening the files once for each group that we need to read, which is doable (in xBOUT
) but probably not the most efficient.
There's been an on-and-off debate on xarray
about how to handle groups - as far as I understand it's not resolved because there's not an obvious solution that works for every use case. We could propose some new functionality for xarray to make things more efficient though, especially if we have a workaround that's acceptable in the meantime. For example, being able to load and merge multiple groups (optionally with some prefix on the variable names or some other way of recording which variables belong to which group) while only opening the file once would go a long way to letting us do (efficiently) whatever we want with xBOUT
. That seems generic enough that I think the xarray devs might be open to a PR adding it, but it'd probably take at least a day or two working on it, so I'd like to be sure of what we want/need before starting.
The replacement of DataFile
with OptionsNetCDF
for the main dump file is almost complete. Still need to replace it in GridFile
, but I'm hoping that's fairly self-contained and easy. A refactoring of Mesh::get
is sort of orthogonal to this.
One thing I've ignored so far is the various options to DataFile
:
OPTION(opt, parallel, false); // By default no parallel formats for now
OPTION(opt, flush, true); // Safer. Disable explicitly if required
OPTION(opt, guards, true); // Compatible with old behavior
OPTION(opt, floats, false); // High precision by default
OPTION(opt, openclose, true); // Open and close every write or read
OPTION(opt, enabled, true);
OPTION(opt, init_missing, false); // Initialise missing variables?
OPTION(opt, shiftoutput, false); // Do we want to write 3D fields in shifted space?
OPTION(opt, shiftinput, false); // Do we want to read 3D fields in shifted space?
OPTION(opt, flushfrequency, 1); // How frequently do we flush the file
Of these, I've only kept enabled
so far, and not as a general option, just specifically for the output and restart files.
For the others:
flush
, flushfrequency
, openclose
, and parallel
are all definitely gone: we now always open and close the file on each write, and there's no parallel implementation.init_missing
can be replaced with Options::withDefault
floats
can probably just go? I've had to tweak the tolerance on some of the tests that have benchmark files that use floats. This could probably be implemented, if really necessaryguards
does anythingWhich just leaves shiftinput/shiftoutput
. These are probably useful? At least shiftinput
could be necessary in order to read older files?
shiftinput
is only useful (I think) for people moving from v3 to v5, although I guess there might be some people from LLNL or Japan who want to do that. If it's not super easy to implement, it'd probably be straightforward to make a pre-processing tool with xBOUT.
I'm not sure what guards
was intended for initially, but it could be useful to have the option to not save the guard cells. It would save quite a lot of disk space, and might also speed up collecting data because we'd be reading contiguous arrays rather than needing to slice of guard cells. If we do implement this, I vote for always including boundary cells, even if we drop boundaryguard cells.
Yeah, guards
does sound useful -- always including boundaries would be essential. Sounds like that could wait for another PR though :)
Ok, next stumbling block: GridFromFile
. The bulk of the complexity here is in calculating the offsets to read the appropriate hyperplane from file for a Field2D/3D/Perp
on a given processor.
Off the top of my head, there's two routes we could go down:
Options
with OptionsNetCDF
, and then take the appropriate hyperplanes from the Matrix
/Tensor
OptionsNetCDF
The first option is probably easy to implement, but would mean we necessarily have to read in the entire gridfile on every processor.
The second option is probably nicer in terms of memory use, etc., and is probably more amenable to eventually using parallel netCDF down the line again perhaps? But poses more questions in terms of API.
Maybe just an overload of read
that takes a const Mesh&
?
Options data = OptionsNetCDF{filename}.read(mesh);
(maybe optional rank too for testing?)
Oh, the other difficulty with option 2 is that when we read in the data in OptionsNetCDF
, we read 1D data into an Array
, 2D data into Matrix
, and 3D data into Tensor
, so we can't distinguish between Field2D
and FieldPerp
. That much at least is necessary for choosing the right offsets and extents.
There's also the question of older files needing FFTs.
This is now done (#2209), and supports both NetCDF and ADIOS2 output (#2766).
The next step will probably be to remove the DataFileFacade
backward-compatibility layer.
There are various issues related to the current DataFile implementation e.g issues #222, #221, #102, #412, #374, #367, #644 . The current design is convenient in many cases, but leads to nasty surprises when more complicated things are attempted. As a reminder, in the current system:
DataFile
provides a consistent interface to variousDataFormat
classes, which each implement interfaces to the NetCDF, HDF libraries etc.DataFile
stores pointers to these objects, which therefore can't be destroyed (e.g. go out of scope):dump.read()
loads all data into the fieldsdump.write()
saves data from all the fields.This makes sense for data which is all output together at fixed time points (e.g. classic NetCDF), but is too restrictive if we want more than one time index (e.g. high, low frequency). It also makes adding diagnostics cumbersome, since temporary variables must be created and added to store the diagnostic values.
I think a new design should have a clearer read() and write() semantics, so to read a value becomes something like:
and writing a value is something like:
There are many different ways of doing this, and several ways we could handle appending time indices. Probably good to look at how other C++11 libraries handle things. For example: https://github.com/BlueBrain/HighFive