desihub / fastspecfit

Fast spectral synthesis and emission-line fitting of DESI spectra.
https://fastspecfit.readthedocs.org
BSD 3-Clause "New" or "Revised" License
13 stars 2 forks source link

how to handle the large number of QA files for public release of the fastspecfit VACs #106

Open moustakas opened 1 year ago

moustakas commented 1 year ago

@sbailey has argued that generating QA for every single DESI target in each public VAC is probably not sustainable (18M targets in Iron alone), so we need to get creative for how the QA can be generated and rendered in the public web-app (https://fastspecfit.desi.lbl.gov). @dstndstn has proposed using the Spin disks themselves to host the files (as a tarball?) but we'll need to look into the details and make sure that we get NERSC on board to help us come up with a (hopefully long-term) solution.

sbailey commented 1 year ago

FYI, it is possible to directly embed PNG data into html without requiring a separate png file to exist on disk. So e.g. you could keep the png data as N>>1 blobs in a format that is optimized for random access (e.g. hdf5) and then embed into html dynamically generated by fastspecfit.desi.lbl.gov. Or store N>>1 html "files" as blobs in that format if even the html part can be pre-generated.

Alternatively, IIRC @dstndstn had a clever trick for creating a disk image that appears as a single file to NERSC but could be mounted by a docker instance to see the N>>1 files within that disk image.

Completely separate from fastspecfit itself, it would be useful to work out an example recipe for the generic problem of how to serve O(millions) of pre-generated "files" without actually generating millions of files on disk. i.e. generating them from scratch is too slow to do on the fly, but there are too many of them to keep on disk, so what's the most efficient way to cache+serve them for random access?

moustakas commented 1 year ago

For those of you with access to the NERSC users Slack space, there's a discussion here which @dstndstn initiated-- https://nerscusers.slack.com/archives/C01LPA84AGM/p1677776147290869

dstndstn commented 1 year ago

I was also reading about (gnu) Tar's "--seek" option -- assume the file is seekable -- which is supposed to allow faster extractions. Doesn't work on compressed tarballs. Maybe worth checking that out, though tar is supposed to auto-detect seekability.

dstndstn commented 1 year ago

So the squashfs disk-image format might be an option too -- there's an "unsquashfs" command that looks like you could use squashfs like 'tar', but presumably with indexing etc definitely built in! My guess is that you want a directory structure for this to work really well (aaa/bbb/aaabbb.html).

dstndstn commented 1 year ago

(I mean, just mounting the squashfs image would be much preferable and easier -- let the kernel do the work! -- but that would require some permissions changes from the Spin team, as discussed in the thread you mention above.)

dstndstn commented 1 year ago

squashfs experiment very successful. This is showing timings for a second run of each program - ie, with disk cache hot:

> time tar xf /pscratch/sd/d/dstn/fastspecfit-fuji-v2.0-html-healpix-sv1-dark.tar --seek healpix/sv1/dark/284/28475/fastspec-sv1-dark-28475-39628433029860119.png

real    0m0.849s
user    0m0.423s
sys     0m0.423s

> time ./rdsquashfs -u dark/284/28475/fastspec-sv1-dark-28475-39628433029860119.png /pscratch/sd/d/dstn/fastspecfit-fuji-v2.0-html-healpix-sv1-dark.squashfs 
creating fastspec-sv1-dark-28475-39628433029860119.png

real    0m0.030s
user    0m0.012s
sys     0m0.014s