cnr-ibf-pa / hbp-bsp-issues

Ticketing system for developers/testers and power users of the Brain Simulation Platform of the Human Brain Project
4 stars 0 forks source link

Unicore launch non consistent with user launch (PizDaint) #553

Closed antonelepfl closed 3 years ago

antonelepfl commented 3 years ago

I launched some job with this config

{
  "Name":"only module load",
  "Executable":"/bin/sh input.sh",
  "haveClientStageIn":"true",
  "Resources":{
    "CPUs":3,
    "Runtime":1815,
    "NodeConstraints":"mc"
  }
}

Where input.sh

#!/bin/bash -l
. /etc/profile
env | grep SLURM_NPROCS
module purge
module load PrgEnv-intel
module load daint-mc cray-python/3.8.2.1 PyExtensions/python3-CrayGNU-20.08
module use /apps/hbp/ich002/hbp-spack-deployments/softwares/15-09-2020/install/modules/tcl/cray-cnl7-haswell

But I get

ModuleCmd_Load.c(244):ERROR:105: Unable to locate a modulefile for 'daint-mc'
cray-python/3.8.2.1(45):ERROR:105: Unable to locate a modulefile for 'PyExtensions/python3-CrayGNU-20.08'

But if I access the machine and allocate some resources and run like

salloc --partition=normal -n 4 --constraint=mc -t 60
sh input.sh

I don't get those error and it works fine.

For example in this folder: /scratch/snx3000/unicore/FILESPACE/9a2ed0d2-6d74-4c72-87d3-26801e073094

BerndSchuller commented 3 years ago

hi Stefano,

not really a Piz Daint expert, but please have a look at the file bsssubmit* in the working directory folder /scratch/snx3000/unicore/FILESPACE/9a2ed0d2-6d74-4c72-87d3-26801e073094

If this looks OK, a full output of "env" in both cases (UNICORE vs ssh&salloc) might help to track down why the UNICORE launch is not doing as expected.

antonelepfl commented 3 years ago

Sam from prgenv-rt@cscs.ch replied:

note that after salloc, you still need to submit your script with srun in order to have it executed on the compute nodes. I don't think however that your script will fail on when submitting it that way on the compute nodes. I think the error you see is rather related to the way unicore submits jobs, and that this leads to the situation that it cannot access Daint's environment properly. I will come back to you as soon as I have more insights.

So they are working on it

antonelepfl commented 3 years ago

A new update on this:

it looks to me like unicore erases for some reason the default content of the MODULEPATH environment variable instead of prepending to it. Thus, the modules daint-mc and PyExtensions cannot be found. Can you print the MODULEPATH variable when submitting a job from unicore and report what you get?

Which I replied:

$echo $MODULEPATH outputs /opt/cray/ari/modulefiles:/opt/cray/pe/craype/default/modulefiles:/opt/cray/pe/modulefiles:/opt/cray/modulefiles:/opt/modulefiles

clupascu commented 3 years ago

I have the same issue. When I submit with sbatch bsssubmit* everything works perfectly though.

antonelepfl commented 3 years ago

@clupascu could you please check if the modules are loaded correctly now?

clupascu commented 3 years ago

@antonelepfl Seems everything works perfectly now. I think we can close this issue. Do you know what caused this problem? Just to remind it for the future...

antonelepfl commented 3 years ago

Thank you Carmen for checking! Not very sure what was the solution. I asked Fabio and I'll be posting some details if he provides them to me

antonelepfl commented 3 years ago

Fabio replied with the changes that he did: