OSC / ondemand

Supercomputing. Seamlessly. Open, Interactive HPC Via the Web
https://openondemand.org/
MIT License
277 stars 104 forks source link

job composer divergent environment #1330

Open sbrozell opened 3 years ago

sbrozell commented 3 years ago

Jobs from the job composer do not seem to have the same environment as jobs submitted from the command line in an ssh session. The two most recent tickets are INC0357151 and INC0356547. There is also an asana task: https://app.asana.com/0/1166442278779601/1200350097179233/f

┆Issue is synchronized with this Asana task by Unito

treydock commented 3 years ago

That's because jobs submitted from web interface are submitted from web node at OSC and jobs submitted from SSH are submitted from a login node. They will not and can not have the same environment, for example web nodes do not need nor will they have Lmod installed because it's not needed to submit jobs, it's needed to run jobs.

sbrozell commented 3 years ago

All jobs whatever their submission source should have the same runtime environment, and user jobs coming from ondemand do not have the same runtime environment and are failing.

treydock commented 3 years ago

Ah yes, runtime environment should be the same as long as the job is submitted with --export=NONE in SLURM. The environment at submit time will differ but the SLURM job startup environment will be the same. The only way to guarantee the same environment with SLURM is to ignore the submit environment, ie use --export=NONE. The default behavior with SLURM is to take the submit environment and apply it to the job environment, ie --export=ALL.

ZQyou commented 2 years ago

Any update on this issue? I'd like to close some old incidents.

sbrozell commented 1 year ago

FWIW, there is a recent ticket with divergent ondemand and ssh batch job behavior INC0369463.

johrstrom commented 1 year ago

Thanks for reminding me of this ticket. We patched 2.0 with the copy_environment flag. Users can check that flag and then we will submit the job with EXPORT=ALL and within their job they should be able to effectively use srun or similar.

sbrozell commented 1 year ago

I have requested that the user try that and report the results.

sbrozell commented 1 year ago

I am resolving INC0369463 since it's been a month w/o user response.

sbrozell commented 1 year ago

I am resolving INC0357151 since the user had workarounds.