gafusion / omas

Ordered Multidimensional Array Structure
http://gafusion.github.io/omas
MIT License
32 stars 15 forks source link

Conversion between tabular/tensor-like data and nested tree data #78

Closed Karel-van-de-Plassche closed 4 years ago

Karel-van-de-Plassche commented 4 years ago

Posting an issue as I couldn't find it clearly in the documentation. Are there axis-aware tools already in OMAS that can convert between tabular data and nested data? I know IMAS likes nestedness, but I tend to prefer pandas DataFrames and/or xarray DataSets. A column in a DataFrame would be equivalent to a leaf node in nested tree, which is not the same as the current omas to_hdf5 functions if I understand correctly, where every leaf is still a leaf, and not matched to its respective time/radial basis.

Karel-van-de-Plassche commented 4 years ago

As an example use-case, I copy here the post I made on a topic where we try to collect and unify many of the JETTO pythontools floating around.

From https://git.ccfe.ac.uk/jintrac/jetto-pythontools/issues/2:

I'm planning to create a few Pythontools for JETTO. It would be nice to have some canonical in-memory representation of JETTO data (e.g. jetto.jsp, jetto.in and family) where all tools would hook into, for maximum portability and modularity. Kinda on the level of the 'OMFIT tree' that is used to communicate between OMFIT modules. I'm not sure about existing workflows and/or ideas of how to do this. I have a few options:

Nested 2D numpy arrays

This is more-or-less how IMAS and OMAS/OMFIT would do it, but I found it convenient to merge the grids/timebases into 2D arrays, where OMAS/IMAS would have every value in a leaf (or maybe 1D arrays of either time or space)

{'INFO': {
    'SHOT': {
        'FORM': 'int',
        'LABEL': 'Shot',
        'SECNUM': 2},
    'XVEC1': {
        'DESC': 'RHO normalised',
        'FORM': 'float',
        'LABEL': 'XVEC1',
        'SCALE': 1.0,
        'SCSTR': '1.0000000E+00',
        'SECNUM': 3,
        'UNITS': None}
    ....
    }
 },
 'R': array([[3.00859894, 3.05335419, 3.09792725, ..., 3.80570496, 3.82918396, 3.85201569]
              ...,
              [3.00859894, 3.05335419, 3.09792725, ..., 3.80570496, 3.82918396, 3.85201569]])
 , ...

1D style (I guess? There is no convention here..):

{R': [
    array([3.00859894, 3.05335419, 3.09792725, ..., 3.80570496, 3.82918396, 3.85201569]),
    array([3.00859894, 3.05335419, 3.09792725, ..., 3.80570496, 3.82918396, 3.85201569])
    ]}

Index to get a specific time slice example:

from EX2GK.jetto_pythontools.jetto_tools.binary import read_binary_file
import matplotlib.pyplot as plt
tind = -1
jsp = read_binary_file('/common/cmg/jcitrin/jetto/runs/run92436_RAPTORbenchmark_newEQDSK_QLK/jetto.jsp')
print('t={:.2f}'.format(jsp['TIME'][tind,0,0])) # Somehow time is 3D, not sure why Aaron?
te = jsp['TE'][tind, :] # First index is time by convention
# jsp_slice = jsp['TE'][tind] # If it would be 1D/IMAS/OMFIT style I guess, have to discus with Orso
rho = jsp['XVEC1'].T # By inspecting shapes/necessity, TE is on XVEC1, and XVEC1 is by convention a 1xrho 2D array
plt.plot(rho, te)

Pros:

Cons:

pandas DataFrames

Consolidate data on the same base into a single pandas dataframe

                           A           ANGF      AREA       B0B       BB0        BP      BPOL           CUBS  CUEC          CUNB  ...  VPOZ        VTOR  WALD          WNBD          WRFD        XA      XPSI      XPSQ      XRHO    ZEFF
time      space                                                                                                                   ...                                                                                                    
50.000000 0.021277  0.112152  114296.953125  0.002058  1.000143  1.000030  0.039838  0.026193     194.466461   0.0    451.146149  ...   0.0  343873.680   0.0  23844.879688  24402.785938  0.025944  0.001567  0.039585  0.021277  1.0001
          0.063830  0.112219  113998.664062  0.018531  1.001265  1.000247  0.119564  0.078610    1562.255615   0.0   2509.284912  ...   0.0  348078.280   0.0  23983.689063  24845.514063  0.077632  0.007314  0.085520  0.063830  1.0001
          0.106383  0.112362  113387.328125  0.051515  1.003346  1.000524  0.199497  0.131072    4449.758789   0.0   5763.462891  ...   0.0  351265.680   0.0  23820.018750  26400.412500  0.129109  0.019858  0.140918  0.106383  1.0001
          0.148936  0.112607  112442.445312  0.101081  1.005896  1.000390  0.279314  0.183436    9205.837891   0.0   9780.824219  ...   0.0  353310.200   0.0  23174.348438  26210.526563  0.180173  0.038675  0.196659  0.148936  1.0001
          0.191489  0.112965  111137.023438  0.167298  1.008457  0.999424  0.357976  0.235292   16094.679688   0.0  14301.555664  ...   0.0  354064.720   0.0  22039.073438  22129.159375  0.230638  0.063733  0.252454  0.191489  1.0001
          0.234043  0.113435  109439.210938  0.250231  1.010667  0.997325  0.434648  0.285871   25211.470703   0.0  19135.205078  ...   0.0  353366.400   0.0  20642.854688  15805.965625  0.280349  0.094928  0.308104  0.234043  1.0001
...
          0.914894  0.140057   16513.597656  3.782796  1.028083  0.883343  0.616263  0.346772  278464.968750   0.0  73925.960938  ...   0.0   62845.885   0.0   1256.954004    126.861011  0.946516  0.926041  0.962310  0.914894  1.0001
          0.957447  0.143909    8390.011719  4.106555  1.008494  0.862467  0.576478  0.317228  326755.500000   0.0  74584.828125  ...   0.0   32126.895   0.0    372.336401     36.150946  0.973632  0.965265  0.982479  0.957447  1.0001
          1.000000  0.148064    4115.367676  4.393200  0.983238  0.841028  0.527873  0.269787  370583.218750   0.0  74461.070312  ...   0.0   15852.460   0.0      0.000000      0.000000  1.000000  1.000000  1.000000  1.000000  1.0001

[1200 rows x 114 columns]

To get a specific slice:

from EX2GK.jetto_pythontools.jetto_tools.binary import read_binary_file
from plotruns import jsp_to_pandas #Not in jetto tools yet
jsp = read_binary_file('/common/cmg/jcitrin/jetto/runs/run92436_RAPTORbenchmark_newEQDSK_QLK/jetto.jsp')
xvec1, xvec2, meta = jsp_to_pandas(jsp)
xvec1_timeslice = xvec1.loc[xvec1.index.get_level_values('time')[-1]] # This is the awkward syntax I spoke off..
xvec1_timeslice['TE'].plot() # Use pandas index-aware plotting

Pros:

Cons:

Combo nested pandas

Kinda like Nested 1D numpy arrays, but with pd.Series instead of np.ndarray (does anyone use pandas?)

Multiruns?

same question here, you can combine multiple runs into one big table too, or keep it nested (I actually prefer nested here)

orso82 commented 4 years ago

@Karel-van-de-Plassche I am not familiar with JETTO. If I understand correctly, you already have some tools to parse the outputs (e.g. read_binary_file) but want to organize your data in some convenient way.

Why not using xarray.Dataset to represent your data? xarrays can slice data very elegantly and OMFIT can handle xarrays nicely ;)

Karel-van-de-Plassche commented 4 years ago

I am not familiar with JETTO. If I understand correctly, you already have some tools to parse the outputs (e.g. read_binary_file) but want to organize your data in some convenient way.

Yes indeed. JETTO outputs many ASCII files, and there are many different scripts floating around to convert them in something useful, and work with them. Most of these ASCII's can be represented in python as a collection of values on different grids (usually 2 spatial grids and 1 temporal one for simulated data). I'm collecting these scripts and combining them to something useful, to avoid duplicate work in the future.

Why not using xarray.Dataset to represent your data? xarrays can slice data very elegantly and OMFIT can handle xarrays nicely ;)

Yes! I love xarray myself, but a lot of the data processing we do does not work nicely in xarray (it's not nicely gridded), so most of our users are more familiar with pandas, which is why I thought about pandas first. Also, OMFIT doesn't seem to handle pandas MultiIndex correctly, and honestly I don't find MultiIndices so user friendly (see aforementioned annoying syntax).

Basically I was more asking 'what kind of tools exist in OMFIT to handle random data', so that I can choose my representation accordingly, in case we ever want to build OMFIT modules. Personally that is unfortunately very low on the priority list; I tend to work standalone, scriptbased, minimalistic, and open source (even not behind password/registration, sorry!) myself. ODS fits nicely in this workflow though, so OMFIT compatibility should not be a problem.

To show a bit better what I mean, find attached three of the options, represented as a ODS (from an OMFIT tree, if you have a better way to share let me know) brine.zip

Converting between nested_2D style <> pandas <> xarray is pretty trivial. Going to a more nested style that maybe would match ODS/IMAS/OMFIT better (e.g. leafs of 1D radial, or leafs of 0D data) would be a bit more work with handling the indices correctly. If there are tools of this type, please let me know, I'd love to test/use them.

from OMFITlib_jsp import *
import xarray as xr
jsp_root = '/home/karel/working/Karel-QLKNN-paper/plot_reproduction/plot_raptor_benchmark'
jsp = read_binary_file(os.path.join(jsp_root, 'run92436_RAPTORbenchmark_newEQDSK_QLK.jsp'))
ods = ODS(consistency_check=False)

# Nested 2D dicts style
ods.from_structure(jsp)
root['OUTPUTS']['nested_2d'] = ods

# Pandas style
ods_pandas = ODS(consistency_check=False)
xvec1, xvec2, meta =
 jsp_to_pandas(jsp)
ods_pandas['xvec1'] = xvec1
ods_pandas['xvec2'] = xvec2
ods_pandas['meta'] = meta
root['OUTPUTS']['pandas'] = ods_pandas

# Xarray style
ds1 = xvec1.to_xarray()
ds1 = ds1.rename({'space': 'space1'})
ds2 = xvec2.to_xarray()
ds2 = ds2.rename({'space': 'space2'})
dss = xr.merge([ds1, ds2])
dss.attrs = meta
ods_xarray = ODS(consistency_check=False)
ods_xarray['data'] = dss
root['OUTPUTS']['xarray'] = ods_xarray

Nested 2D dicts style is tangentially related to https://github.com/gafusion/omas/issues/74

Karel-van-de-Plassche commented 4 years ago

PS. extra reason I'm asking here is that you have a lot of experience with user-friendliness, while I tend to build stuff to complicated for the average user (whoops) ;) Although almost all of our users are also developers.

PPS. Looks like OMFIT has OMFITdataset which subclasses a monkey-patched xarray.core.dataset.Dataset. If I would create a JETTOdataset, that would probably be instantly compatible right? (assuming I use composition over inheritance )

orso82 commented 4 years ago

@Karel-van-de-Plassche perhaps it OMAS would be useful if you decided to use the IMAS naming conventions, but it seems to me that right now OMAS is not buying you anything here besides unnecessary complexity.

Looks like you have already done the hard work of converting everything to xarrays ;) How about defining your own JETTO class, which subclasses from xarray.Dataset. Looks pretty elegant to me and easy to use 👍You can then add specific methods to this class, and it would all play out very nicely within OMFIT! finally, I'd use more user-friendly names for the spatial coordinates, something like rho_nodes and rho_segments.

class JETTO(Dataset):

    def __init__(self, filename):
        from EX2GK.jetto_pythontools.jetto_tools.binary import read_binary_file
        jsp = read_binary_file(filename)
        xvec1, xvec2, meta = jsp_to_pandas(jsp)
        Dataset.__init__(self)
        ds1 = xvec1.to_xarray()
        ds1 = ds1.rename({'space': 'rho_nodes'})
        ds2 = xvec2.to_xarray()
        ds2 = ds2.rename({'space': 'rho_segments'})
        self.update(xarray.merge([ds1, ds2]))
        self.attrs.update(meta)

jsp_root = '/home/karel/working/Karel-QLKNN-paper/plot_reproduction/plot_raptor_benchmark'
jsp_filename = os.path.join(jsp_root, 'run92436_RAPTORbenchmark_newEQDSK_QLK.jsp')
jetto = JETTO(jsp_filename)

image

Karel-van-de-Plassche commented 4 years ago

Do I have to subclass necessarily for OMFIT, or is composition also fine? xarray devs recommend against subclassing, but I'd rather take the dev complexity hit and be OMFIT compatible.

If I don't use IMAS naming conventions, OMAS does give me the nice slicing of nested and structured still, right? However, there are many non-OMAS deep slicing alternatives {pythonQL (Not sure if this is related to JSONQL), YAQL, ObjectPath, JMESPath}

orso82 commented 4 years ago

@Karel-van-de-Plassche I find subclassing easier for users to understand: they just have to know that it works like an xarray.Dataset. I just tried to take a slice and I got back a JETTO object, as expected.

jetto_slice=jetto.isel(time=2)

image

If you avoid using internal xarray APIs when subclassing, then you'd avoid that your code breaks when xarray upgrades. That said, you can use composition if you prefer, but I'd like to see an example where subclassing breaks down.

I am not sure what you'd use OMAS for (or any of the other packages), since xarray.Datasets are flat and they natively support slicing of the multidimensional data...

If I were you, I'd develop the JETTO class independently of OMFIT, and then have an OMFITjetto class that subclasses JETTO while adding some of the functionalities that are OMFIT specific. The code below should work with the latest OMFIT unstable. Let me know if it gives you trouble.

# standalone JETTO class, independent of OMFIT
class JETTO(Dataset):

    def __init__(self, *args, **kw):

        filename = None
        args = list(args)
        if len(args) and isinstance(args[0], basestring):
            filename = args.pop(0)
        if 'filename' in kw:
            filename = kw.pop('filename')
        # allow initializing this class from a jsp filename
        if filename is not None:
            jsp = read_binary_file(filename)
            xvec1, xvec2, meta = jsp_to_pandas(jsp, *args, **kw)
            Dataset.__init__(self)
            ds1 = xvec1.to_xarray()
            ds1 = ds1.rename({'space': 'rho_nodes'})
            ds2 = xvec2.to_xarray()
            ds2 = ds2.rename({'space': 'rho_segments'})
            self.update(xarray.merge([ds1, ds2]))
            self.attrs.update(meta)
        # allow initializing this class from as a bare Dataset object
        else:
            Dataset.__init__(self, *args, **kw)

    def test_function(self, what):
        print(self[what])

jetto = JETTO(None)
jetto.test_function('CHEW')

jetto_slice = jetto.isel(time=2)
jetto_slice.test_function('CHEW')

# JETTO class within OMFIT
# carries with it the original jsp file when saving a project
class OMFITjetto(OMFITdataset, JETTO):
    @dynaLoad
    def load(self):
        JETTO.__init__(self, self.filename)

oj = OMFITjetto('test')
oj.test_function('CHEW')

oj_slice = oj.isel(time=2)
oj_slice.test_function('CHEW')
Karel-van-de-Plassche commented 4 years ago

I am not sure what you'd use OMAS for (or any of the other packages), since xarray.Datasets are flat and they natively support slicing of the multidimensional data...

Not sure yet, but JETTO has many files and not 100% sure if they all fit nicely in a Dataset, but for sure this specific file I wouldn't need OMAS/nestedness. Thanks a lot for your help and suggestions, I think I now have a clear idea what to do.