galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
38 stars 36 forks source link

No space left on device even though NFS and nodes have disk space #393

Closed pcm32 closed 1 year ago

pcm32 commented 1 year ago

After I resized the NFS disk (and a fresh call to df -h . shows there is capacity from the containers), I keep getting this errors from the container resolver:

galaxy.tool_util.deps.containers INFO 2022-12-05 17:03:17,624 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] Checking with container resolver [ExplicitContainerResolver[]] found description [None]
galaxy.tool_util.deps.containers ERROR 2022-12-05 17:03:18,020 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] Could not get container description for tool 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_read_10x/scanpy_read_10x/1.8.1+2+galaxy0'
Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/tool_util/deps/containers.py", line 320, in find_best_container_description
    resolved_container_description = self.resolve(enabled_container_types, tool_info, **kwds)
  File "/galaxy/server/lib/galaxy/tool_util/deps/containers.py", line 351, in resolve
    container_description = container_resolver.resolve(
  File "/galaxy/server/lib/galaxy/tool_util/deps/container_resolvers/mulled.py", line 557, in resolve
    name = targets_to_mulled_name(
  File "/galaxy/server/lib/galaxy/tool_util/deps/container_resolvers/mulled.py", line 361, in targets_to_mulled_name
    tags = mulled_tags_for(namespace, target.package_name, resolution_cache=resolution_cache, session=session)
  File "/galaxy/server/lib/galaxy/tool_util/deps/mulled/util.py", line 127, in mulled_tags_for
    if not _namespace_has_repo_name(namespace, image, resolution_cache):
  File "/galaxy/server/lib/galaxy/tool_util/deps/mulled/util.py", line 113, in _namespace_has_repo_name
    preferred_resolution_cache[cache_key] = repo_names
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/cache.py", line 374, in __setitem__
    self.put(key, value)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/cache.py", line 317, in put
    self._get_value(key, **kw).set_value(value)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 417, in set_value
    self.namespace.release_write_lock()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 231, in release_write_lock
    self.close(checkcount=True)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 254, in close
    self.do_close()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 685, in do_close
    util.safe_write(self.file, pickled)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/util.py", line 502, in safe_write
    fh.close()
OSError: [Errno 28] No space left on device

I did leave the galaxy containers running while increasing the disk size, so this might be due to maybe galaxy needing a restart after disk re-sizing? It is the only piece of code that is complaining about disk space, everything else seems to work. Most leaving here for reference.

pcm32 commented 1 year ago

...mmm... deleting the pod and getting a new one for job handler didn't do the trick, it kept having this error. So this is trying to write somewhere where there is no space. Is not the nodes and is not the shared file system....

pcm32 commented 1 year ago

It seems to be failing at setting a key at this ResolutionCache from .container_resolvers import ResolutionCache.

pcm32 commented 1 year ago

Downscaling everything that had mounted the NFS and then re-upscaling seems to have fixed it.

pcm32 commented 1 year ago

This was probably because the disk re-sizing was partly done in "hot".