facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.3k stars 125 forks source link

Keep original tmp slurm submission file as a hidden symlink #1771

Closed xman1979 closed 2 months ago

xman1979 commented 2 months ago

Why making this change?

if we do "scontrol show job", we get the submission scripts pointed to the temporary submission file which got removed, e.g:

(jepa) [xiaodongma@rsccpu4035 xiaodongma]$ scontrol show job 4499193
JobId=4499203 JobName=xiaodongma
  ...
   Command=/home/xiaodongma/jepa-internal/xiaodongma/submission_file_e9c4eef46a24436b81d5213875f19d6c.sh
...

this can bring confusion to slurm ecosystem and make it hard to integration with other tooling that relies on parsing/post-inspecting the sbatch script.

Fix

This diff create the temporary submission file as a symlink to the moved submission file

Test

after fix, we can see the submission file

scontrol show job 4499203
JobId=4499203 JobName=xiaodongma
... Command=/checkpoint/amaia/video/xiaodongma/vjepav3/arch/vjepav1/vit.l.16.m8/.submission_file_bb581d4ec3954cd9a45aa7388ad6494e.sh
...
(jepa) xiaodongma@xiaodongma-login-0:/checkpoint/amaia/video/xiaodongma/vjepav3/arch/vjepav1/vit.l.16.m8$ ll /checkpoint/amaia/video/xiaodongma/vjepav3/arch/vjepav1/vit.l.16.m8/.submission_file_bb581d4ec3954cd9a45aa7388ad6494e.sh
lrwxrwxrwx 1 xiaodongma fair_amaia_cw_video 101 Sep 17 17:48 /checkpoint/amaia/video/xiaodongma/vjepav3/arch/vjepav1/vit.l.16.m8/.submission_file_bb581d4ec3954cd9a45aa7388ad6494e.sh -> /checkpoint/amaia/video/xiaodongma/vjepav3/arch/vjepav1/vit.l.16.m8/job_1358522/1358522_submission.sh
jrapin commented 2 months ago

I'd rather the submission file be hidden as you had initially proposed, to avoid messing up (too much) with the folder