Open dnelson86 opened 6 months ago
Has been added to www.tng-project.org/data/obs/
import matplotlib.pyplot as plt
import numpy as np
import scida
path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'
ds = scida.load(path + filename)
fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))
classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}
for cl, label in classes.items():
w = np.where(ds['class'] == cl)[0]
inds = np.argsort(ds['z'][w])
im2d = ds['flux'][w[inds],:]
im = axes[cl].imshow(im2d)
fig.colorbar(im, label=ds['flux'].units)
plt.savefig("sdss-dr17-spectra.png", dpi=150)
Is very slow, not much happens after awhile. What is a more efficient approach?
Let me give it a try, but I would already have the following suggestions:
da
over np
for dask.array as da
where possible.im2d
does not appear to be a particularly small array (~1/3 4864154 4700 * 4 ~ 30 GB each). Fixing this will require you to pick one of two routes:
a = ds["a"].compute()
) and performing all operations in numpy and only using scida for reading these arrays in. This appears feasible as the intermediate data products have a similar size as the final data product, thus not requiring any distributed memory setup.Can you convert the code into a more efficient version, that downsamples the array to e.g. ~1000x1000 pixels (plenty) for the imshow?
Here's an example to get you started:
path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'
ds = scida.load(path + filename, units=False)
classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}
def average_bins(array, bin_size=1000):
"""Average bins in the first axis by a given bin_size."""
a, b = array.shape
remainder = a % bin_size
# If there is a remainder, pad the array
if remainder != 0:
pad_size = bin_size - remainder
padded_array = np.pad(array, ((0, pad_size), (0, 0)), mode='constant', constant_values=0)
else:
padded_array = array
# Now, reshape and compute the mean along the new axis
reshaped_array = padded_array.reshape(-1, bin_size, b)
averaged_array = reshaped_array.mean(axis=1)
return averaged_array
ims = []
for cl, label in classes.items():
# class and z are small arrays, so we just load them into memory as numpy arrays...
w = np.where(np.array(ds['class']) == cl)[0]
z = np.array(ds['z'])[w]
inds = np.argsort(z) # ... and argsort is not properly supported by dask anyway
flux = ds['flux'].rechunk((-1, 100)) # important to rechunk in first dimension where we use indices
im2d = flux[w[inds],:]
# reduce size by averaging 10000 spectra each
bin_size = 10000
im2d = average_bins(im2d, bin_size)
im2d = im2d.compute()
ims.append(im2d)
fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))
for i, (ax, im2d) in enumerate(zip(axes,ims)):
im = axes[i].imshow(im2d, origin="lower", aspect="auto")
ax.set_xlabel("spectral direction")
if i==0:
ax.set_ylabel("redshift direction")
else:
ax.set_yticks([])
fig.colorbar(im, label="flux")
plt.savefig("sdss-dr17-spectra.png", dpi=150)
See code comments for details of relevant changes. This was run in ~1 min in a distributed environment with a vera/p.large node on 8 cores.
There is a new obs data file available:
/virgotng/mpia/obs/SDSS/sdss-dr17-spectra.hdf5
As with the GAIA file, the
description
attributes contain unit metadata.Can you make a code snippet to use scida to: create a 3-panel plot, each panel a 2D image of
spec
(color is flux), forz
(ordered) vswave
. The three panels separate byclass
(0 = galaxy, 1 = star, 2 = qso).Can start an "obs cookbook".
Nice addition to the docs gallery.