Open pcm32 opened 1 year ago
Indeed, https://github.com/galaxyproject/galaxy/blob/8249b430520ac3d301bed9a58e19d5c1a6448fa1/lib/galaxy/objectstore/s3.py#L235 is only ordering files by last use date and deleting from oldest to newest until the target is reached. I would suggest that this should check which files are being used by current jobs and protect those.
I think we're actually going to remove the cache manager and suggest external cleanup, which can be constrained to terminal jobs.
Describe the bug
On constrained cases where maybe the object store cache size is not big enough, you can see the job requesting objects actually proceeding when some of the objects are not present in the cache. This could be happening due to the job requiring too many objects perhaps or because the size of the cache is not enough to accommodate them. I haven't checked, but I wonder if the object store cache maintainer logic (that keeps it within the defined size) actually checks if it can delete an specific object from the cache to free space given the set of active jobs in different handlers (I guess this could be complicated with multiple handlers, but I'm guessing that either the message queues or the database must have this information).
It leads to tool errors like this:
Maybe we should have a soft limit and a hard limit for the object store cache size, so that the soft limit can be transgressed under specific circumstances to let a job go through, but if the hard limit gets past, then the job errors our with a descriptive message that explains the situation.
Galaxy Version and/or server at which you observed the bug Galaxy Version: 22.05 Commit: 36f80978e1b9743f413491a51f47c21ba522c6ed
To Reproduce Steps to reproduce the behavior:
Expected behavior
Object store cache space freeing up should check if the files being deleted are needed for active jobs or not, or make sure that files are in place right before the job execution.