S3 based object store client: sometimes jobs proceed even when not all required files are available in the cache

pcm32 commented 1 year ago

Describe the bug

On constrained cases where maybe the object store cache size is not big enough, you can see the job requesting objects actually proceeding when some of the objects are not present in the cache. This could be happening due to the job requiring too many objects perhaps or because the size of the cache is not enough to accommodate them. I haven't checked, but I wonder if the object store cache maintainer logic (that keeps it within the defined size) actually checks if it can delete an specific object from the cache to free space given the set of active jobs in different handlers (I guess this could be complicated with multiple handlers, but I'm guessing that either the message queues or the database must have this information).

It leads to tool errors like this:

Traceback (most recent call last):
  File "/galaxy/server/database/jobs_directory/000/923/configs/tmp2jqfmorw", line 23, in <module>
    ad_s = sc.read('embedding_source_0.h5')
  File "/usr/local/lib/python3.9/site-packages/scanpy/readwrite.py", line 112, in read
    return _read(
  File "/usr/local/lib/python3.9/site-packages/scanpy/readwrite.py", line 713, in _read
    return read_h5ad(filename, backed=backed)
  File "/usr/local/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 408, in read_h5ad
    with h5py.File(filename, "r") as f:
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/files.py", line 406, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = 'embedding_source_0.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Maybe we should have a soft limit and a hard limit for the object store cache size, so that the soft limit can be transgressed under specific circumstances to let a job go through, but if the hard limit gets past, then the job errors our with a descriptive message that explains the situation.

Galaxy Version and/or server at which you observed the bug Galaxy Version: 22.05 Commit: 36f80978e1b9743f413491a51f47c21ba522c6ed

To Reproduce Steps to reproduce the behavior:

Set the object store cache size to a low value (provided your instance is using object store that is not HDD).
Run a job that uses a large collection of large files as input.
Look at the logs to see how it will download files and sometimes start regaining space, at the expense of the same inputs possibly.

Expected behavior

Object store cache space freeing up should check if the files being deleted are needed for active jobs or not, or make sure that files are in place right before the job execution.

pcm32 commented 1 year ago

Indeed, https://github.com/galaxyproject/galaxy/blob/8249b430520ac3d301bed9a58e19d5c1a6448fa1/lib/galaxy/objectstore/s3.py#L235 is only ordering files by last use date and deleting from oldest to newest until the target is reached. I would suggest that this should check which files are being used by current jobs and protect those.

mvdbeek commented 1 year ago

I think we're actually going to remove the cache manager and suggest external cleanup, which can be constrained to terminal jobs.

galaxyproject / galaxy

S3 based object store client: sometimes jobs proceed even when not all required files are available in the cache #15282