OSC / ondemand

Supercomputing. Seamlessly. Open, Interactive HPC Via the Web
https://openondemand.org/
MIT License
280 stars 104 forks source link

Myjobs app is not starting the job in the job's directory #3800

Open cipharius opened 1 week ago

cipharius commented 1 week ago

I've been troubleshooting this for a while, thinking something was wrong with my configuration, but it seems that when myjobs ResourceMgrAdapter queues the job, it doesn't pass the working directory to the job system adapter.

Up until this point, the information about script's directory is retained: https://github.com/OSC/ondemand/blob/master/apps/myjobs/app/models/resource_mgr_adapter.rb#L37-L46

Once the submit is invoked on the job adapter, the information about working directory is lost - the script's working directory isn't correct and the script's directory doesn't appear in the environment.

If I add workdir: Dir.pwd to the Script.new argument list, the jobs are ran in the script's directory instead of user's home directory.

I was not sure if this is the correct place to fix this issue so instead of PR I'm opening an issue instead.

johrstrom commented 1 week ago

Luckily, I've already been through this on discourse. I'm assuming you use a submit_host or some wrapper to submit jobs somewhere other than the OOD VM?

This is happening because we do chdir into the right directory while submitting the job, but since you're SSHing somewhere else to issue the job submission command, the CWD is HOME.

https://discourse.openondemand.org/t/simple-question-execute-python-code-on-a-lsf-submit-host/3560/23

I was not sure if this is the correct place to fix this issue so instead of PR I'm opening an issue instead.

Either is fine by me! Even in debugging that discourse topic, it didn't occur to me to just specify the workdir instead of relying on Dir.chdir.

PRs welcome!

cipharius commented 1 week ago

Thanks for the quick reply!

Yeah, that is correct, I am running the jobs on remote machine that is sharing the same home directory subtree. Open OnDemand is supposed to be a frontend to a Slurm HPC cluster previously accessed via CLI only.

In that case I'll open a PR for the explicit workdir parameter. Wouldn't affect those who ran the batch jobs from OOD machine itself, though would change behaviour for those who already got used to the jobs running under home directory on remote node.

I was mostly suprised that this is how it had worked all this time and I couldn't find posts complaining about this specific issue.