DiamondLightSource / httomo

High-throughput tomography pipeline
https://diamondlightsource.github.io/httomo/
BSD 3-Clause "New" or "Revised" License
4 stars 3 forks source link

CUDA kernels compiled by cupy are cached in home dir -> watch out for home dir quota at DLS #243

Open yousefmoazzam opened 3 months ago

yousefmoazzam commented 3 months ago

CUDA kernels defined and compiled with cupy (such as ElementwiseKernel) are cached in the user's home directory https://docs.cupy.dev/en/stable/reference/generated/cupy.ElementwiseKernel.html.

There is an 8GB limit on user's home directories at DLS which is very easy to reach quickly. Running httomo locally has produced a "disk quota exceeded error" at times, due to the compiled kernels being written to my home directory and my home directory being at the quota limit.

I'm not 100% sure, but it may be possible that a SLURM job running httomo as any given user would put the cached kernels into the user's home directory (maybe SLURM overrides stuff like this somehow?).

There apparently exists an environment variable CUPY_CACHE_DIR which cupy uses to determine the place where the compiled cached kernels should be written to. If it indeed is the case that a SLURM job running for some user X places the compiled cached kernels in user X's home directory, it might be useful to choose a directory which is less likely to be subject to disk quota errors.

dkazanc commented 2 months ago

thanks for noticing this, I was somehow thinking that the installed conda environment will be used for cashing compiled kernels! So according to this we need to set this path somehow to be accessible for everyone running httomo at Diamond, but I guess that could be the HTTomo's installed location? _The compiled code is also cached in the directory ${HOME}/.cupy/kernel_cache (the path can be overwritten by setting the CUPY_CACHE_DIR environment variable). This allows reusing the compiled kernel binary across the process._

yousefmoazzam commented 2 months ago

I'm not 100% sure on what I say next, it's got a bit of guesswork in there, so take it with a grain of salt.

My guess is that, because the SLURM job will run as the user, wherever the kernels are written to, the user needs permissions to write to. HTTomo will be installed in /dls_sw/apps, and I'm not sure all DLS staff have write permissions to that directory? Ie, if we try to write the compiled kernels httomo's installation location, I'm wondering if some users will get write errors when kernels are attempted to be compiled and stored by cupy?

Another reason that httomo's installation location may not be ideal for storing compiled kernels is that we do get the almost routine emails nowadays about /dls_sw being at almost full capacity. The compiled kernel sizes are enough to cause myself some issues with the home directory quota; they may not be big enough to have a significant impact on the capacity of /dls_sw, but it may be better to favour a more cautious approach and not write compiled kernel binaries to a filesystem that is operation critical and is often very full.

One suggestion I have is to use temp directories: they do get cleaned up routinely, so some benefit of the caching will be lost every 30 days or whatever the cleanup period is, but it would be less dangerous in terms of running out of space, and also less dangerous in terms of causing issues that affect things outside of httomo.

dkazanc commented 2 months ago

fair enough, thanks for the reminder about permissions! That won't work of course. temp could be a solution indeed, I guess we'd rather keep compiled kernels globally for everyone to access to. Btw I checked my folder size of compiled kernels and it is only 10MB, they are not supposed to be large in size aren't they?

yousefmoazzam commented 2 months ago

That's a good point about checking the sizes of them, I didn't do that, 10MB is pretty negligible (I was clearly on the very edge when I was getting quota errors... :sweat_smile:).

Anywhere globally accessible to users that doesn't require special write permissions sounds like a good first try.