Store atomic and isotopic data in HDF5 format

PlasmaPy / PlasmaPy

An open source Python package for plasma research and education

http://docs.plasmapy.org

BSD 3-Clause "New" or "Revised" License

565 stars 336 forks source link

Store atomic and isotopic data in HDF5 format #1915

Open namurphy opened 1 year ago

namurphy commented 1 year ago

Feature description

Currently, our atomic data files are stored in plasmapy/particles/data in elements.json and isotopes.json. This issue proposes to change them to HDF5 format.

Motivation

JSON is human readable but slow to access, while HDF5 is less human readable but fast to access. Elemental and isotopic is accessed really frequently (in particular by @particle_input), so performance is more important than readability for this occasion. Additionally, when we update the data sets (#1914), we'll probably end up rebuilding the files entirely, so readability of the data itself is not as important.

JSON is also better to track changes via version control, but we haven't updated the files in 2.5 years, so that's less of an issue too.

Implementation strategy

We could probably do this at the same time that we address #1914.

Additional context

See also #591.

namurphy commented 1 year ago

@rocco8773 pointed out to me today that the contents of HDF5 files are not loaded into memory all at once but rather as needed, so it's likely that this would improve import speed but there might be a slight penalty when instantiating a Particle. We'd probably need to check on this.
"Storing isotopic data" can be sung to the theme song of Teenage Mutant Ninja Turtles.

StanczakDominik commented 1 year ago

Wait a minute. IIRC I thought the entire JSON is loaded into memory at (sub)package import time. Then you'd gain precisely nothing by changing the storage mechanism.

Moreover, h5py is... unwieldy at best, and a heavy dependency.

This is not something we should do unless and until we actually profile the runtime of a particle initialization and prove that loading the data is an important factor in its speed that actually needs optimizing.

If we really wanted to speed up particles... I don't know, I briefly thought about sticking them into a dataclass (#1110 is the most related, I suppose), but then again astropy.units will be slowing it down as well.

namurphy commented 1 year ago

I do remember trying out imports of different subpackages a while back, and importing plasmapy.particles had been the slowest (I think maybe around the time of #1630). My main hypotheses about the causes were the time it takes to read in the JSON files, and/or the time it takes to automatically instantiate a few particles (like plasmapy.particles.proton and plasmapy.particles.electron).

What I'm wondering about is that...the vast majority of elemental and isotopic data will not get used in a typical application, so in principle, lazy loading of data would be really helpful. However, I agree with you that profiling would be necessary for us to make good decisions.

I just started playing with...

$ python -X importtime -c "import plasmapy" 2> import_plasmapy.log
$ pip install tuna
$ tuna import_plasmapy.log

...using tuna to visualize it.

namurphy commented 1 year ago

Also...maybe HDF5 would not be the best binary format to consider. There are other alternatives with better performance.