aces / cbrain

CBRAIN is a flexible Ruby on Rails framework for accessing and processing of large data on high-performance computing infrastructures.
GNU General Public License v3.0
71 stars 42 forks source link

Implement workaround for unreliable apptainer boot #1325

Closed prioux closed 1 year ago

prioux commented 1 year ago

On some environements, if an apptainer container takes too long to mount its userspace filesystems, the container fails to setup. On ComputeCanada is looks like this:

FATAL:   container creation failed: mount hook function failure:
  mount /proc/self/fd/9->/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/apptainer/1.1.6/var/apptainer/mnt/session/data-images/0
  error: while mounting image /proc/self/fd/9:
  fuse2fs failed to mount /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/apptainer/1.1.6/var/apptainer/mnt/session/data-images/0 in 10s

There's a maximum allowed delay of 10 seconds for the FUSE mount to succeed.

Our script that launch apptainer containers should detect such situations and just try again, maybe multiple times (5 times?)

prioux commented 1 year ago

Ok I will have to work on this today

prioux commented 1 year ago

Implemented in https://github.com/aces/cbrain/commit/eae3d9333e657daa32a13f2d942c39f9110b7f93