access-ci-org / Jetstream_Cluster

Scripts and Ansible Playbooks for building an HPC-style resource in Jetstream
MIT License
19 stars 16 forks source link

Problem: Submitting Slurm test job fails due to missing files & commands #18

Closed julianpistorius closed 4 months ago

julianpistorius commented 4 months ago

Follow up from #16 and #17.

I just tested #17 by launching a Slurm cluster from Exosphere on Jetstream2, and there seems to be a problem:

$ sbatch slurm_test.job                                                                          
Submitted batch job 1 
$ cat nodes_1.out
environment: line 17: /usr/share/lmod/lmod/libexec/lmod: No such file or directory                                                               
environment: line 17: /usr/share/lmod/lmod/libexec/lmod: No such file or directory                                                               
/tmp/slurmd/job00001/slurm_script: line 8: mpirun: command not found
zacharygraber commented 4 months ago

Continuing conversation here, rather than in the closed PR.

This is unexpected. I set up a cluster before issuing the PR and ran this same sample job with no issues.

If mpirun isn't found, it sounds like the compute node base image didn't actually get set up properly. If it's still around, could you share the entire local_create.log?

I'm trying again quick to see if I get the same result.

julianpistorius commented 4 months ago

Sorry, I already deleted the instance. :disappointed:

julianpistorius commented 4 months ago

This is unexpected. I set up a cluster before issuing the PR and ran this same sample job with no issues.

Same. So I was also surprised. Hopefully it's a fluke?

zacharygraber commented 4 months ago

Looks like a fluke? I just created 2 clusters back-to-back, one stock-Exosphere-push-button-cluster (so pulling this repo) and one by using my fork, and both of them worked no problem.

I hate to call it a "can't reproduce," but I can't seem to.

julianpistorius commented 4 months ago

Weird. I'll have another go. Thank you for trying.

julianpistorius commented 4 months ago

Yup. Now it works. :man_shrugging: Apologies for the false alarm.