cbyrohl / scida

scida is an out-of-the-box analysis tool for large scientific datasets. It primarily supports the astrophysics community, focusing on cosmological and galaxy formation simulations using particles or unstructured meshes, as well as large observational datasets. This tool uses dask, allowing analysis to scale.
https://scida.io
MIT License
26 stars 4 forks source link

obs example: SDSS spectra #154

Open dnelson86 opened 6 months ago

dnelson86 commented 6 months ago

There is a new obs data file available:

/virgotng/mpia/obs/SDSS/sdss-dr17-spectra.hdf5

As with the GAIA file, the description attributes contain unit metadata.

Can you make a code snippet to use scida to: create a 3-panel plot, each panel a 2D image of spec (color is flux), for z (ordered) vs wave. The three panels separate by class (0 = galaxy, 1 = star, 2 = qso).

Can start an "obs cookbook".

Nice addition to the docs gallery.

dnelson86 commented 6 months ago

Has been added to www.tng-project.org/data/obs/

dnelson86 commented 6 months ago
import matplotlib.pyplot as plt
import numpy as np
import scida

path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'

ds = scida.load(path + filename)

fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))

classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}

for cl, label in classes.items():
    w = np.where(ds['class'] == cl)[0]
    inds = np.argsort(ds['z'][w])

    im2d = ds['flux'][w[inds],:]
    im = axes[cl].imshow(im2d)
    fig.colorbar(im, label=ds['flux'].units)

plt.savefig("sdss-dr17-spectra.png", dpi=150)

Is very slow, not much happens after awhile. What is a more efficient approach?

cbyrohl commented 6 months ago

Let me give it a try, but I would already have the following suggestions:

  1. be explicit about the use of dask operations, right now you are using functions such as np.where, np.argsort. When passing dask functions, the dask wrappers will be used automatically, but it will be hard to see when and whether this conversion happens. Use da over np for dask.array as da where possible.
  2. Make sure that regular logging warnings from dask are activated (i.e. not disabled), this often indicates unintended performance bottlenecks.
  3. im2d does not appear to be a particularly small array (~1/3 4864154 4700 * 4 ~ 30 GB each).
  4. Slicing operations in a distributed setup can be very costly. It very much comes down to the chunking in the different dimensions, make sure to read this.

Fixing this will require you to pick one of two routes:

  1. Identify performance bottlenecks as laid out above, particularly the chunking relevant for the slicing operation, and only compute a more reduced data product than a 10GB sized array.
  2. If you are really interested in the 10GB sized output to be displayed via matplotlib, I would convert all arrays relevant here to numpy arrays (e.g. a = ds["a"].compute() ) and performing all operations in numpy and only using scida for reading these arrays in. This appears feasible as the intermediate data products have a similar size as the final data product, thus not requiring any distributed memory setup.
dnelson86 commented 6 months ago

Can you convert the code into a more efficient version, that downsamples the array to e.g. ~1000x1000 pixels (plenty) for the imshow?

cbyrohl commented 6 months ago

Here's an example to get you started:

path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'

ds = scida.load(path + filename, units=False)

classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}

def average_bins(array, bin_size=1000):
    """Average bins in the first axis by a given bin_size."""
    a, b = array.shape
    remainder = a % bin_size
    # If there is a remainder, pad the array
    if remainder != 0:
        pad_size = bin_size - remainder
        padded_array = np.pad(array, ((0, pad_size), (0, 0)), mode='constant', constant_values=0)
    else:
        padded_array = array
    # Now, reshape and compute the mean along the new axis
    reshaped_array = padded_array.reshape(-1, bin_size, b)
    averaged_array = reshaped_array.mean(axis=1)

    return averaged_array

ims = []
for cl, label in classes.items():
    # class and z are small arrays, so we just load them into memory as numpy arrays...
    w = np.where(np.array(ds['class']) == cl)[0]
    z = np.array(ds['z'])[w]
    inds = np.argsort(z)  # ... and argsort is not properly supported by dask anyway
    flux = ds['flux'].rechunk((-1, 100))  # important to rechunk in first dimension where we use indices
    im2d = flux[w[inds],:]

    # reduce size by averaging 10000 spectra each
    bin_size = 10000
    im2d = average_bins(im2d, bin_size)
    im2d = im2d.compute()
    ims.append(im2d)

fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))
for i, (ax, im2d) in enumerate(zip(axes,ims)):
    im = axes[i].imshow(im2d, origin="lower", aspect="auto")
    ax.set_xlabel("spectral direction")
    if i==0:
        ax.set_ylabel("redshift direction")
    else:
        ax.set_yticks([])
fig.colorbar(im, label="flux")
plt.savefig("sdss-dr17-spectra.png", dpi=150)

See code comments for details of relevant changes. This was run in ~1 min in a distributed environment with a vera/p.large node on 8 cores.

sdss-dr17-spectra