SvenMarcus / hpc-rocket

https://svenmarcus.github.io/hpc-rocket/
MIT License
26 stars 3 forks source link

SlurmError: sbatch slurm.job #51

Open MartinGrignard opened 1 month ago

MartinGrignard commented 1 month ago

Hi :wave:

I'm currently trying to run HPC-rocket to submit a job from my local machine (before integrating it in a GitLab CI/CD pipeline). I created a simple slurm.job that only prints the hostname to check if the job runs properly.

Here is my configuration:

host: ...
user: ...
private_keyfile: ...

copy:
  - from: slurm.job
    to: slurm.job
    overwrite: true

clean:
  - slurm.job
  - slurm-hpc-rocket.log

sbatch: slurm.job

When I run the following command:

hpc-rocket launch --watch config.yml

I get the following output:

ℹ Copying files...
✔  Done
❌ SlurmError: sbatch slurm.job
[==  ]

Since there is no additional logs, and the job runs when I submit it manually on the cluster, do you have any idea of what could be the problem here?

Thanks!

SvenMarcus commented 3 weeks ago

Hi, sorry for the late reply, I just got back from a vacation. This error usually happens when hpc-rocket fails to launch the slurm job entirely. Can you show me the content of the log file of the slurm job?

MartinGrignard commented 1 week ago

Sorry for the delay, I also was away for the last 2 weeks.

I don't have a log from the job. Using sacct shows that it is not even submitted.

Can it be due to the fact that slurm is actually a module on our cluster, meaning it may not be loaded at the start of the session depending on the type of session HPCrocket uses?

EDIT:

I just tried to change the command for something else, and it looks like none of the commands I tried proceed without raising an error. Hence, it means that, for some reason, the call to cmd.wait_until_exit() always returns a non-zero exit code.

Since HPC rocket manages to copy the files to the remote, it looks like it does not come from a connection issue...