Cosmological hydrodynamical simulations are excellent numerical laboratories to investigate the formation of galaxies and large scale structure. They provide a highly detailed realization of structures in the universe across a vast range of spatial and temporal scales (7 orders of magnitude in dynamic range in space and time). The simulation outputs are information-rich (e.g. 6D phase space + density, gas temperature and up to 100 isotope abundances for each gas/star particle + many other fields for each mass component). This complexity is overwhelming for humans to examine and understand.
The problem was traditionally approached by collapsing the rich multidimensional data onto a simplified 0D representation (single scalars) of galaxy/halo properties (e.g. stellar mass, morphology, half-light radius, mean surface brightness, bulge/disk ratio, maximum circular velocity). This approach was inspired by the data scarcity of observational astronomy, where it is much more efficient to measure relations between global galaxy properties (e.g. the Hubble diagram, the galaxy main sequence, the Tully-Fisher relation). The same is true for the ‘extrinsic’ causes of these properties, like DM halo shape, environment, or assembly history. For more fine-grained analysis, these properties are expressed in 1D (density/mass profiles). The collapse of simulation data from >3D to 0/1D wastes most of the detailed structure information and removes valuable insight into the physics behind galaxy and structure formation.
Proposed solution
Dimensionality reduction is a well developed field of machine learning that aims to create compact representations of complex high-dimensional data that efficiently capture the most information without the need for labels with the goal of allowing easier visualization and interpretation. Instead of collapsing structures to 0/1D along arbitrary projections guided by human intuition, we propose to let an unsupervised dimensionality reduction ML model find the most efficient representation of simulated structures (and in particular galaxies) in a latent space that has low enough dimensionality that it can be inspected easily and interactively, even for the largest cosmological simulations. The galaxies in this space can be painted with the traditional 0D properties to aid human interpretation and to enable knowledge discovery. Furthermore, they can painted using latent representations of the extrinsic variables (like formation history or environment) to find and investigate the causal drivers of observed galaxy properties. The visualization is lightweight such that it can be run from any laptop via a webserver, and it provides functionality to interactively inspect and select subsamples of data for local analysis.
Design and user interaction
A Hypershpherical Variational Autoencoder (S-VAE) architecture is trained to reconstruct the structure of galaxies in 2D (i.e. images) or 3D (i.e. point clouds) using all the galaxies in the user-defined sample (with the option to use the full galaxy sample). The S-VAE learns a 2D spherical surface latent representation of the galaxies using the von Mises-Fisher distribution.
The training data consists of 2D or 3D projections of the simulated galaxies in the selected sample. The user first selects the galaxy sample using global galaxy properties (e.g. stellar mass, location, etc), and the fields to be extracted along with the dimensionality of the projections (2/3D or 6D phase space). A preprocessing routine then extracts the particle/mesh data from the simulation snapshots directly from the cloud using the metadata stored in the preexisting halo catalogs. The same routine then prepares the sample of galaxy projections for model training.
Once the S-VAE is trained to a satisfactory reconstruction performance (i.e. 5-10% reconstruction RMSE), Spherinator produces an interactive representation of the simulated galaxies' distribution in a specific component (e.g. stellar/gas density/metallicity/DM, etc.) on a 2D spherical surface latent space using hierarchical HiPS tilings using the Aladinlite tool (deployed on a webserver).
The bounded latent representation groups galaxies with similar morphologies together while covering the entire sphere.
The Aladinlite interface allows interactive visualization of the hierarchical projection for both the input data and the reconstruction at each point on the sphere using mouse gestures to rotate the sphere and zoom on a specific region.
The GUI also provides interactive examination of each object in the latent space via colored markers that can be turned on/off as layers. The marker for a specific galaxy can be selected with the mouse. A small pop-up window provides details of the object (the metadata), including the simulation name, the snapshot number, the unique galaxy ID, and the latent coordinates, and a preview thumbnail of the original image/object. Clicking on the thumbnail opens the image or 3D visualization of the object in a separate window. This allows the user examine the galaxy and rotate it in 3D to understand the meaning of each feature in the latent space. A link to the API used to access the full data for each object is also provided in the pop-up window. The user can select one object or a 2D region of the sphere using the mouse and Spherinator will download the full particle data for the selected objects from the raw data via a webserver API.
Spherinator is trained on the IllustrisTNG publicly available simulation data to demonstrate its functionality. The AE is trained on postage stamp face-on 2D projections or 3D point clouds of the ~50k galaxies in the 100Mpc and 50Mpc boxes.
The user can then choose additional fields to paint the markers for each galaxy (e.g. global galaxy properties like stellar mass, metallicity, age, etc. and environmental metrics) on the latent projection to find any correlations with the structure representation and generate science questions.
Statement of the problem
Cosmological hydrodynamical simulations are excellent numerical laboratories to investigate the formation of galaxies and large scale structure. They provide a highly detailed realization of structures in the universe across a vast range of spatial and temporal scales (7 orders of magnitude in dynamic range in space and time). The simulation outputs are information-rich (e.g. 6D phase space + density, gas temperature and up to 100 isotope abundances for each gas/star particle + many other fields for each mass component). This complexity is overwhelming for humans to examine and understand.
The problem was traditionally approached by collapsing the rich multidimensional data onto a simplified 0D representation (single scalars) of galaxy/halo properties (e.g. stellar mass, morphology, half-light radius, mean surface brightness, bulge/disk ratio, maximum circular velocity). This approach was inspired by the data scarcity of observational astronomy, where it is much more efficient to measure relations between global galaxy properties (e.g. the Hubble diagram, the galaxy main sequence, the Tully-Fisher relation). The same is true for the ‘extrinsic’ causes of these properties, like DM halo shape, environment, or assembly history. For more fine-grained analysis, these properties are expressed in 1D (density/mass profiles). The collapse of simulation data from >3D to 0/1D wastes most of the detailed structure information and removes valuable insight into the physics behind galaxy and structure formation.
Proposed solution
Dimensionality reduction is a well developed field of machine learning that aims to create compact representations of complex high-dimensional data that efficiently capture the most information without the need for labels with the goal of allowing easier visualization and interpretation. Instead of collapsing structures to 0/1D along arbitrary projections guided by human intuition, we propose to let an unsupervised dimensionality reduction ML model find the most efficient representation of simulated structures (and in particular galaxies) in a latent space that has low enough dimensionality that it can be inspected easily and interactively, even for the largest cosmological simulations. The galaxies in this space can be painted with the traditional 0D properties to aid human interpretation and to enable knowledge discovery. Furthermore, they can painted using latent representations of the extrinsic variables (like formation history or environment) to find and investigate the causal drivers of observed galaxy properties. The visualization is lightweight such that it can be run from any laptop via a webserver, and it provides functionality to interactively inspect and select subsamples of data for local analysis.
Design and user interaction