cbyrohl / scida

scida is an out-of-the-box analysis tool for large scientific datasets. It primarily supports the astrophysics community, focusing on cosmological and galaxy formation simulations using particles or unstructured meshes, as well as large observational datasets. This tool uses dask, allowing analysis to scale.
https://scida.io
MIT License
26 stars 4 forks source link

spatial load of particle data #79

Open dnelson86 opened 1 year ago

dnelson86 commented 1 year ago

Feature request: function/helper to load the particle-level data for a specific region of space.

Support for simple geometry (cube, sphere, cuboid) sufficient.

Approach: just start with global load, then simple mask. This would be a generic fallback for any more sophisticated approaches.

Note that some datasets (e.g. EAGLE-Original) are already spatially ordered, and have saved spatial keys. These could be optionally used, if present, to accelerate the process i.e. avoid the initial global load.

dnelson86 commented 11 months ago

For the THESAN data release they have implemented spatial "hashtables" which are just uniform grid cells to particle index lists, ignoring the issue that such lists will be made up of many small slices and thus result in very sub-optimal I/O.

The (heuristic) choices in how to achieve a requested I/O load made up of a (large) series of slices is then a different issue, which could be optimized further.

A similar spatial tree creation/caching, on demand if a spatial region load is requested, could be similarly implemented. The performance benefits would have to be assessed.

cbyrohl commented 11 months ago

Sounds interesting. I would be interested to see the actual performance benchmark on a cosmological simulation like THESAN/TNG (across different redshifts; as the fraction of halo-associated and ordered particles evolves substantially). We could test this on Thesan, and based on this consider adaption to dask reads (rather than direct numpy reads per-file).