Plan to writing extra metadata for ``yt``: Any requests/suggestions?

Overview

I would like to improve the yt-frontend for Enzo-E. For example, it would be really nice to automatically:

detect which fields are passively advected scalars
detect/define species and thermodynamic fields

Basically, I’m thinking that we could introduce a hdf5-group called "description" (or "metadata" or something else?) to all output files (even those written as checkpoints) and then store extra metadata as attributes within that group. The idea is this info would be write-only (it would not be used when restarting a simulation).

Within this section, we might record the field-groups. This would help a lot with:

detecting what fields are passive scalars- determining which thermodynamic fields can be trusted. For example, the "temperature", "pressure", and "cooling_time" fields shouldn’t be used by yt to compute derived fields unless they are part of the "derived" group in enzo-e.- In particular, it’s really important for yt to know whether temperature is in the "derived" field group since a lot of other derived quantities can depend on it. In simulations with radiative cooling, yt needs to use temperature to self-consistently compute the number density and mean molecular mass.

We could also write some physics-dependent metadata.

For example, in simulations using Grackle, it would be helpful to dump at least some of the Runtime Parameters to help with automatically deriving species/thermodynamic information
when Grackle’s primordial_chemistry flag is zero (this can be determined by inspecting the parameter file) we really want to record the value of the HydrogenFractionByMass parameter so that yt can self-consistently compute the number density of Hydrogen nuclei. This is usually not explicitly set by users and could be useful for a number of analyses such as computing X-ray emissivity.
(In reality, it would be very straight-forward to dump all the values of all of Grackle's flags)

Purpose of this issue

Basically I'm opening this issue for 2 reasons:

to solicit feedback from other people (especially people familiar with yt) to make suggestions about what other information to record within this section.
to discuss the optimal approach for implementing this in the codebase in a manner that's portable across output approaches (since I'm not very familiar with the io infrastructure). I was thinking about maybe writing an IoMetaData class that we could subclass/customize in the enzo-layer... @jobordner do you think this seems viable? (maybe I could do something simpler and just register a callback somewhere).

Hi @mabruzzo, there are a few things that I think would improve the QOL for enzo-e in yt. We have a frontend that regards it as a block-structured index rather than patch-based, which @BolunThompson has led the work on, that is still being integrated. I think in general, more metadata is good and I'm eager to work with you on enumerating that list.

The other item that I think is much more invasive, which would also likely improve the performance of the yt frontend considerably, is to change the way the patches are stored. Two specific changes would be very helpful:

Add a system that allows us more easily to identify (preferably without using string parsing) the positions of the patches.
Store them as a single large dataset within each output file, so that rather than N datasets of size (P, Q, R) each, they are stored as one dataset for each field of shape either (N, P, Q, R) or (P, Q, R, N). (N here is the number of patches within that individual output file.)

I recognize that point 2 is likely intractable, but I wanted to put it out there anyway. For point 1, having a binary index that we could apply a uniform (or even tightly-looped) operation to that translates into, say, a Z-order index (with the bits pre-swizzled) would be very helpful as well.

Thanks for responding @matthewturk. Any help would definitely be useful! Let me just quickly respond to your suggestions:

Add a system that allows us more easily to identify (preferably without using string parsing) the positions of the patches.

This is definitely doable! Internally, the location of each block is tracked with a 96 bit index -- it tracks the block-level, the position of the current-block (or ancestor) on the root grid, and location relative to the parent on the root block (I'm a little fuzzy on how the internal representation maps to root levels). The string-name that we assign to each block is just a translation of this index.

We could easily store the position of a block in a more accessible manner. There's even a method we use for load-balancing that orders blocks in terms of their morton index. So we can probably just extract the logic from there.

Store them as a single large dataset within each output file, so that rather than N datasets of size (P, Q, R) each, they are stored as one dataset for each field of shape either (N, P, Q, R) or (P, Q, R, N). (N here is the number of patches within that individual output file.)

I think this is something we can definitely work towards (but it would definitely involve some larger refactoring).

@matthewturk and @BolunThompson - After thinking about this a little more, there is a small wrinkle that I would appreciate some clarification on.

All discussions I've seen of swizzled Z-order indices assume that the domain is divided into the same (power-of-two) number of sub-cells along each axis. However, non-cosmological Enzo-E simulations commonly don't satisfy this assumption.

For example, we could have a simulation with 16 root-blocks[^1] along the x-axis, 4 root-blocks along the y-axis, and 2 root-block along the z-axis. In this scenario, a root-block or it's descendants always require a different number of bits to encode the position along each axis. The position along the x-axis requires 2 more bits than the position along the y-axis and 3 more bits than the position along the z-axis.

From the perspective of the new frontend, what be the most useful thing for us to save in such a scenario?

My default assumption is that we should zero-pad the most significant bits before computing the swizzled Z-order index and then we can save it to a 128-bit integer. This way we preserve the property that dropping the least significant bits always gives the Z-order index of the parent block.
Alternatively, we can just write three 32-bit integers (one per axis) and let yt handle the swizzling...
Or would you prefer that we do something else?

[^1]: As you're probably aware Enzo-E divides the domain into an array-of-octrees. In other words, we have a root grid where the number of blocks along each dimension is a power-of-2. And then each of these root blocks can be refined as octrees.

That's an interesting point, and one I had not thought of. I suppose what I would reframe my discussion as would be to instead either have swizzled inside the individual octrees (rather than the array-of-octrees), or to potentially just include the individual 32 bit axial indices as their 32 bit numbers, rather than the string representation. If the former, then it would require two keys -- the index into the forest and then within that octree the z-order.

However, I will also just note that it's not necessarily the case that this is the problematic area, and I didn't mean to derail from the discussion about fields etc, which are likely much more easy to modify to improve QOL.

enzo-project / enzo-e

Plan to writing extra metadata for ``yt``: Any requests/suggestions? #352

Overview

Purpose of this issue