Additional Data Formats?

ax3l commented 1 year ago

Thank you for the JOSS submission in https://github.com/openjournals/joss-reviews/issues/5375 .

I really like the support of the IAEA data loaders.

Based on the extended abstract and linked motivating discussion in it, I was wondering: I am personally curious if, for phase space data, the openPMD standard [1] [2] (disclaimer: I lead this effort) could be helpful as an additional input loader source? We have by now a relatively large selection of accelerator codes supporting openPMD as their output and also try to use it more in experimental laser-plasma accelerator work.

The paper summarizes so far:

[...] extensible library enabling import/analysis/export of PhaseSpace data of arbitrary format.

If one were to implement another loader, how much work would be needed? I am looking at https://bwheelz36.github.io/ParticlePhaseSpace/new_data_loader.html

and am further curious about data sizes: #158

Update: I found https://bwheelz36.github.io/ParticlePhaseSpace/code_docs.html#ParticlePhaseSpace.DataLoaders.Load_PandasData which might be pretty easy to couple to openPMD with https://github.com/openPMD/openPMD-api/blob/0.15.1/examples/11_particle_dataframe.py (example data sets here). (Our Pandas reader supports chunked processing - let's continue discussion on lasy loading/streaming/out-of-core processing in #158)

Note that the linked reference Kuschel, S. (2022). Postpic. https://github.com/skuschel/postpic implemented openPMD early on. Minor correction: I think it should read (2014) as of the first release for this reference.

[1] https://github.com/openPMD [2] https://www.openPMD.org

bwheelz36 commented 1 year ago

Hey @ax3l - I have to admit that I was embarrassingly not actually aware of openPMD! It looks great.

It is fairly minimal amount of work to add new Loaders/Exporters (depending of course on how complex the data source is). I would be happy to take a look at loading openPMD data. I don't suppose you already have some files handy I could test on? Also, I notice that openPMD supports multiple data formats. It might be quite some work to write a DataLoader that handled several formats, but as a proof of principle would it be acceptable to just demonstrate on one format?

ax3l commented 1 year ago

Hi @bwheelz36 , sorry for the edit in my original message.

I added a few example files and a probably four liner to load data via an edit :)

import openpmd_api as io

s = io.Series("../samples/git-sample/data%T.h5", io.Access.read_only)
electrons = s.iterations[400].particles["electrons"]  # 400 or another "step" in the data series

df = electrons.to_df()  # careful: all SI at this point

After finishing the docs, I would also be excited to attempt an exporter :star_struck:

(Please do not feel that my implementation questions as required for the JOSS review to pass. I am just truly curious and the other comments in between for the manuscript are more important to add please :) )

bwheelz36 commented 1 year ago

Hi @ax3l

That's all good - given there is a defined open dataset format, it absolutely makes sense that this package should support it.

Having said that - I'm a bit confused tbh. I'm trying to run the first read example from the openpmd-api site with the following code:

import openpmd_api as io
series = io.Series( "data%T.h5", io.Access.read_only)

I pointed this code to each of the three examples example-2d',example-3d', example-thetaMode - (it is actually not that clear from the example that this is what you are supposed to do?). In each case the data loads, but there is no information in the 'iterations' attribute?

ax3l commented 1 year ago

Hi @bwheelz36,

Thanks for trying the example datasets! The iterations concept is explained here: https://openpmd-api.readthedocs.io/en/latest/usage/concepts.html

there is no information in the 'iterations' attribute?

Please let me know if you have more questions on this in case I missed the point of the question :)

Once you open a data Series, you can loop over available iterations in it, read the data in each iteration, etc

fields: https://openpmd-api.readthedocs.io/en/latest/usage/firstread.html or (for our case here)
particles: https://github.com/openPMD/openPMD-api/blob/0.15.1/examples/2_read_serial.cpp#L48-L67
- or as data frame (preferred here): https://github.com/openPMD/openPMD-api/blob/0.15.1/examples/11_particle_dataframe.py#L30-L34

bwheelz36 commented 1 year ago

Hi @ax3l

Ok, here's an end to end example of what I tried. Maybe I'm doing something extremely stupid...

in a terminal:

# inside a fresh virtual environment
git clone https://github.com/openPMD/openPMD-example-datasets.git
cd openPMD-example-datasets
tar -zxvf example-2d.tar.gz
tar -zxvf example-3d.tar.gz
tar -zxvf example-thetaMode.tar.gz

pip install openpmd-api
python  # enter python session

inside python:

import openpmd_api as io

data_loc = "example-2d/hdf5/data%T.h5"
s = io.Series(data_loc, io.Access.read_only)

Here's the explorer view of s; it appears to simply have nothing in it?

ax3l commented 1 year ago

Oh that is wild, thanks for reporting! We check against most of those files in CI, but maybe something slipped in that we did not cover :-o

I will double check this after my conferences and summer break.

franzpoeschel commented 1 year ago

For this, see my comment here:

The string representations of many classes are counterintuitive and have led to confusion, e.g. series.iterations printed will look as if it is empty

I guess that this issue is proved again.. The data is there, it just does not look like it:

>>> import openpmd_api as io
>>> s = io.Series("data%T.h5", io.Access.read_only)
>>> s.iterations
<openPMD.Attributable with '0' attributes>
>>> [index for index in s.iterations]
[255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 305, 310, 315, 320, 325, 330, 335, 340, 345, 350, 355, 360, 365, 370, 375, 380, 385, 390, 395, 400]

franzpoeschel commented 1 year ago

Fixed in https://github.com/openPMD/openPMD-api/pull/1476

ax3l commented 1 year ago

Thank you for updating the representation strings, @franzpoeschel! This will be shipped with the next patch release, 0.15.2.

@bwheelz36 for your example above, all looks good and you can keep exploring what is inside the data series s like this:

for k_i, i in s.iterations.items():
    print("Iteration: {0}".format(k_i))

    for k_p, p in i.particles.items():
        print("  Particle species '{0}':".format(k_p))

inside the particle species p is then a record component that is a key-value pair of a string + record component, which can be accessed like a numpy array, e.g., u_x = p["momentum"]["x"][()] - note that s.flush() will fill the array u_x with actual data.

Even easier is the access as a data frame, as in the 11_particle_dataframe.py example:

for i in s.iterations:
    for p in i.particles:
        df = p.to_df()
        print(df)

ax3l commented 1 year ago

@bwheelz36 did this help? :)

bwheelz36 commented 1 year ago

Hi @ax3l - the first loop you posted above helps yes - it is clear there is some data there! in that example, doing p.to_df() gives a dataframe which would facilitate close to one-to-one read in to ParticlePhaseSpace.

the second loop crashes with AttributeError: 'int' object has no attribute 'particles'. I added a line if hasattr(i, 'particles'): however this was never entered...

Can I make sure I understand the intent behind iterations - each iteration would represent for instance a time interval?

franzpoeschel commented 1 year ago

the second loop crashes with AttributeError: 'int' object has no attribute 'particles'. I added a line if hasattr(i, 'particles'): however this was never entered...

I think that there is a slight bug in the second loop, try this one:

for it_index, it in s.iterations.items():
    for p in it.particles:
        df = p.to_df()
        print(df)

bwheelz36 / ParticlePhaseSpace

Additional Data Formats? #156