ComputeCanada / software-stack-config

8 stars 3 forks source link

fix: add `/localscratch` to `APPTAINER_BIND` #83

Closed wdconinc closed 1 month ago

wdconinc commented 3 months ago

We have been encountering issues with /localscratch (i.e. $SLURM_TMPDIR) not getting mounted on startup of jobs that are landing on cedar from OSG htcondor. Because htcondor's +SingularityImage causes the job to immediately start a container, we don't have a way to inject APPTAINER_BIND modifications.

This PR adds /localscratch to the paths that are injected in APPTAINER_BIND for bind mounting.

bartoldeman commented 3 months ago

Hi @wdconinc thank you for your contribution. I'll have to double check if this works on all clusters and on both login nodes and compute nodes though: the way /localscratch is configured differs a little, and sometimes the apptainer module is loaded on a login node but apptainer executed with those bind settings on the compute node.

That said, there's another alternative: /var/tmp. This directory dynamically bind mounts $SLURM_TMPDIR on all GP clusters (Niagara computes have no local disk), and is automatically mounted by apptainer already. So if your workflow can simply use /var/tmp instead of $SLURM_TMPDIR the extra bind mount would not be needed.

wdconinc commented 3 months ago

I think we've now found a way around this. We are submitting jobs to cedar through an OSG htcondor glideins that runs as a slurm job (or plural). The slurm environment is available to our jobs and was preferentially used over the htcondor environment, in particular what concerns SLURM_TMPDIR and _CONDOR_SCRATCH. We have now modified our jobs to use the htcondor tempdirs over the slurm tempdirs and that gets us unstuck.