include input parameters in hdf5 output?

PrincetonUniversity / athena

Athena++ radiation GRMHD code and adaptive mesh refinement (AMR) framework

https://www.athena-astro.app

BSD 3-Clause "New" or "Revised" License

239 stars 129 forks source link

include input parameters in hdf5 output? #143

Closed alwinm closed 6 years ago

alwinm commented 6 years ago

Just wondering about an idea that might take more effort than it's worth to implement. Hoping it might be useful to at least start a discussion about this idea?

Part 1

Has there already been discussion for attaching input parameters or config parameters in output files? I don't know if this is easy/hard to do for some formats but it seems like it would work fine with HDF5.

I am imagining that output files can have all the athena config data, maybe in the form of a single string or in the form of a dictionary. Examples (from wiki):

python configure.py --prob linear_wave --flux hllc
...
dict['prob'] = 'linear_wave'
dict['flux'] = 'hllc'

But also a dictionary with input data, in a format similar to what athena_read.py would return: dict['block']['parameter'] = value.

It seems like maybe this could improve portability and keeping track of exactly what went into a simulation without having to search the rest of the directory for an athinput file, but would also require (a small amount of) extra disk space. Naively, this seems like it would be easy to implement.

Part 2

In the long run, maybe a version of Athena++ could be run from an HDF5 file, making it easy to restart from an output without a separate restart file. Or even editing an HDF5 file by hand to plug it into Athena without writing a whole new problem generator. I imagine this would make it easy to do "zoom-in" simulations or restarts with drastically different changes.

Here is a workflow I imagine:

python athena_from_hdf5.py Output.athdf

The python script uses the .athdf file to get the right config and configures Athena++ appropriately. It also gets the input arguments and runs Athena++. These parts are already easy from utils.athena.py, which can configure and run Athena++. The hard part is an additional flag or some way for Athena++ to know it has to read the current state from the HDF5 file. And also a generic HDF5 reader to actually do the reading. I don't know how much extra data and bookkeeping would be needed to do this for AMR, weird boundary conditions, weird meshes, etc. It seems like people are working on readers for Athena++ to load other data/tables anyway, so this might not be too hard to do.

Part 3

I'm under the impression that all of this information already exists in restart files. I suggested a few vague ideas for moving outputs closer to (editable, flexible) restarts, but maybe it would be easier to go from the other direction and enable a form of data analysis with parsing restart files. This is very unclear to me, and it's also not clear how easy it would be to edit a restart file or excise just a part of it.

felker commented 6 years ago

Thanks for prompting this discussion, Alwin.

I don't think anything that you have suggested is infeasible or costly from a technical point of view, but it is more of a philosophical question of "portability" that I imagine went into deciding the current setup. We can easily add more file-level attributes to the HDF5 outputs and/or processing with athena_read.py, as I requested for reading the time value in #33.

However, once you start expanding the set of output parameters bundled with each HDF5 file beyond the self-describing physical/mesh quantities (resolution, timeslice, coordinates), I am not sure where you draw the line: all of the values in the associated athinput. file? The ./configure.py arguments? The target hardware/compiler/library versions? For example, I think all of these parameters would require that you also keep track of the code version, such as the Git SHA1, in order to be truly meaningful. The set of such options for the solver and the problem are always evolving.

The current minimalist approach might require some Athena++ users to devote a good deal of time and effort developing their own workflows for organizing and annotating their outputs. But, I don't think that can really be avoided.

If it is a matter of convenience, I think a Python script for interacting more transparently with restart files could be a good idea (I don't use restart files very often). Also, not all parameters are written to restart files, see: https://github.com/PrincetonUniversity/athena/blob/f35362df4b369e077d11d84f6da32926185c2207/src/outputs/restart.cpp#L101-L108 EDIT: I was mistaken about this; the entire set of input parameters and configure options are written in the header, as pointed out below. https://github.com/PrincetonUniversity/athena/blob/f35362df4b369e077d11d84f6da32926185c2207/src/outputs/restart.cpp#L96-L99

I would be interested to hear what @jmstone 's thoughts are regarding this design choice. Does anyone know how other codes approach this bookkeeping issue?

msbc commented 6 years ago

There is code to read athinput files. See #59 and athena_read.py. The headers of restart files are essentially athinput files, so athena_read.py could be modified to read this. If this is of interest to you let me know.

I think I talked to @c-white about using restart files for post processing. If I recall correctly, the issue here is that restart files are architecture dependent and thus are not really portable.

c-white commented 6 years ago

Including more attributes with HDF5 would be easy, but I agree with Kyle that I don't know where to draw the line. There's also a very slight worry about performance, depending on how this is implemented. There's no telling what the H5Awrite() library routine is doing under the hood, so having a hundred such sequential calls might have an adverse effect when running with a few hundred thousand MPI ranks. (Of course there would be no performance worry if/when we get around to writing HDF5 files without the HDF5 library...)

For HDF5 to function as a restart format, it would have to be augmented with:

The input file (as found in the restart files in plaintext at the beginning).
A small amount of metadata (the binary stuff just after the input file in restarts, linked by Kyle above).
Physical ghost zone data (in case of do-nothing boundary conditions).
Magnetic fields on faces rather than at cell centers.
All necessary variables (primitive, not just conserved, variables in GR; the full radiation field, not just a few moments; all particle positions; etc.).
All user-allocated mesh and block data arrays.
Everything in Real (default double) as opposed to H5T_NATIVE_REAL (default single) precision.

For these reasons restart files are generally at least 2 times the size of HDF5 data dumps, and sometimes much more. So even if we had a single unified file format, one would still want to output restartable files less frequently than other snapshots.

On the topic of restart portability, the files are architecture/compiler-dependent in the following ways:

All numbers are dumped as-is in memory, including native endianness, with no indication in the file of what the format is. (HDF5 describes the length and endianness of all data.)
That little metadata section contains a RegionSize struct written as-is, which in practice may contain 4 padding bytes for alignment purposes.

It wouldn't be too hard to change these, but doing so will probably break backward compatibility.

It's worth noting that restart files contain less description of the simulation than one might expect. For instance, they say nothing about what variables are actually stored. The idea is that the restart constructors use the ParameterInput values and InitUser...() pgen functions just as the non-restart constructors do, creating just the right number of blocks and allocating all the appropriate arrays. They then read in the data from the restart file in the same order it's written to the file. So for instance if you have one executable that allocates some ruser_meshblock_data arrays and another that doesn't, they cannot read each other's restart files.

alwinm commented 6 years ago

Another question is how much sloppiness can be tolerated.

If I recall correctly, the issue here is that restart files are architecture dependent and thus are not really portable.

This would be another good reason why an hdf5-based restart option could be useful. Collaborators could send each other (or themselves, on different machines) a standard file format and use it with a variety of architectures, compilers, even try out different codes or versions of Athena starting from the same snapshot.

But this introduces some sloppiness or lack of self-consistency. I am not an expert but I can already imagine questions like "what does it mean that 1 code ran this part of the simulation and another code ran the next part."

For example, I think all of these parameters would require that you also keep track of the code version, such as the Git SHA1, in order to be truly meaningful. The set of such options for the solver and the problem are always evolving.

I initially imagined just the athinput and ./configure.py arguments, just the bare minimum needed to evolve a state forward in time and to do post-processing/analysis, but indeed it would be more self-consistent to also include the code version and use the exact same solver, problem, hardware/compiler/library versions.

Although introducing sloppiness means the feature can be mis-used, it also enables an easier way of writing initial conditions. Imagine setting initial conditons by editing an .athdf file using special functions in Python in fewer lines of code than a pgen, but also having an easier time debugging those initial conditions. (Re)starting a simulation from an .athdf file instead of a .rst file is perhaps a separate issue but related to the question of what else can/should be included in an .athdf file.

c-white commented 6 years ago

One idea that would be simple to implement and might address some of the requested features is writing a pgen (or at least a template pgen) that takes an HDF5 file (not even necessarily a fully working .athdf file) and uses the data in the file as initial conditions.

First we should make a standard HDF5 reader interface - some utility function that is given a filename and a dataset name (and maybe some slicing indices) and populates an AthenaArray with the values in the file. I know I've written such a function at least once, and I'm sure others have as well.

This would not account for all of the athinput and pgen data, so those files would still need to be handled separately, but it would allow one to easily set initial conditions from messy data rather than simple functional forms.

jmstone commented 6 years ago

I think it is wise to maintain a difference between HDF files and restart files. The latter must contain a lot of extra information. They can be dumped infrequently. They have to be dumped in double precision, while it is useful to sometimes dump HDF in single precision.

It is easy to write a script/program than can extract data from restart files. Similarly it is easy to write a problem generator that sets up initial conditions from an HDF file (I know Munan did this as part of her thesis). We should probably commit a general version of Munan’s problem generator for use by others. The next person to start runs from an HDF file can update Munan’s file and commit it.

Note that restart files should already contain the configure information in that they contain a dump of the input file. So as long as you write restart files you can always figure out everything about a run. At least that was the intention.

If there is some particular need for a feature for something somebody is doing now we can always add it. However, unless it is truly needed, my philosophy is to keep things as simple as possible.

On Jul 5, 2018, at 5:42 PM, Kyle Gerard Felker notifications@github.com wrote:

Thanks for prompting this discussion, Alwin.

I don't think anything that you have suggested is infeasible or costly from a technical point of view, but it is more of a philosophical question of "portability" that I imagine went into deciding the current setup. We can easily add more file-level attributes to the HDF5 outputs and/or processing with athena_read.py, as I requested for reading the time value in #33 https://github.com/PrincetonUniversity/athena/issues/33.

However, once you start expanding the set of output parameters bundled with each HDF5 file beyond the self-describing physical/mesh quantities (resolution, timeslice, coordinates), I am not sure where you draw the line: all of the values in the associated athinput. file? The ./configure.py arguments? The target hardware/compiler/library versions? For example, I think all of these parameters would require that you also keep track of the code version, such as the Git SHA1, in order to be truly meaningful. The set of such options for the solver and the problem are always evolving.

The current minimalist approach might require some Athena++ users to devote a good deal of time and effort developing their own workflows for organizing and annotating their outputs. But, I don't think that can really be avoided.

If it is a matter of convenience, I think a Python script for interacting more transparently with restart files could be a good idea (I don't use restart files very often). Also, not all parameters are written to restart files, see: https://github.com/PrincetonUniversity/athena/blob/f35362df4b369e077d11d84f6da32926185c2207/src/outputs/restart.cpp#L101-L108 https://github.com/PrincetonUniversity/athena/blob/f35362df4b369e077d11d84f6da32926185c2207/src/outputs/restart.cpp#L101-L108 I would be interested to hear what @jmstone https://github.com/jmstone 's thoughts are regarding this design choice. Does anyone know how other codes approach this bookkeeping issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PrincetonUniversity/athena/issues/143#issuecomment-402861215, or mute the thread https://github.com/notifications/unsubscribe-auth/AEO7zinUNPn2knvUh7ZvEFjg-i1MVUDYks5uDohLgaJpZM4VEgNB.

tomidakn commented 6 years ago

I think the restart output and the HDF output should be separete. A restart file contains most information of a simulation run but not all, as many data that can be calculated (e.g. primitive variables, coordinates, etc.) are not stored (and note that it does not know anything about the code itself nor the problem generator). I believe the HDF output should be as compact as possible, but at the same time it must be more flexible than the restart file to accommodate user-defined outputs. In short, they are different, although in some situations they look similar.

To be honest, I hate HDF5. It is slow and not scalable. The current data format is tuned for performance but as a result we cannot fully utilize its hierarchical structure and the format is ugly. In the long run, I believe we should abandon HDF5 and move to our own format, providing analysis code package including a reader plug-in for VisIt etc.. Then this format can be similar to the restart file, and we may be able to make them somewhat compatible. However I do not think we should invest too much on HDF5 now.

jmstone commented 6 years ago

To be honest, I hate HDF5. It is slow and not scalable. The current data format is tuned for performance but as a result we cannot fully utilize its hierarchical structure and the format is ugly. In the long run, I believe we should abandon HDF5 and move to our own format, providing analysis code package including a reader plug-in for VisIt etc.. Then this format can be similar to the restart file, and we may be able to make them somewhat compatible. However I do not think we should invest too much on HDF5 now.

Yes, I agree completely with this. I very much would like to abandon HDF5 and replace it with our own files written by mpi-io. I have tried to get the viz staff at PICSciE help us write a plug-in for VisIt so that we could go this route but so far with no luck.

felker commented 6 years ago

So, if I am interpreting the last few posts correctly, the actionable steps are:

Add an HDF5 reader interface. Include a version of @munan 's problem generator for starting simulations from this data (instead of from a functional form) + athinput. file.
Perhaps write a Python script for parsing Restart files?

And in the long term, move away from HDF5 completely.

@msbc , what input formats are you planning on supporting for the table reader in #56?

munan commented 6 years ago

What I wrote in my thesis is actually a reader of the VTK output of athena4.2. I'm happy to merge it to the code, but I'm not sure whether it's that useful for other people. If people want a reader for the VTK file in Athena++, it would be easy to modify from what I have, and I can do it if needed.

felker commented 6 years ago

After discussing this with @munan in person, I think it would be good to merge her VTK-generated initial condition pgen/ file, then either duplicate it for HDF5 or extend it to accept either input file format.

c-white commented 6 years ago

As an update, I made the hdf5_reader branch with a partially working general purpose reader. There's still some bugs to be worked out, mostly to do with hyperslabbing. The branch has a new pgen that initializes values from an HDF5 file, as a way of testing the reader. Ultimately of course we'll want to merge the VTK and HDF5 capabilities.

felker commented 6 years ago

Closing this since #146 was merged and address the core of this issue. Still, we could consider adding:

[ ] VTK reader counterpart
[ ] Python script for parsing .rst files