facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.25k stars 120 forks source link

Submitit jobs die with no error on cluster with SLURM 19.05 #1762

Open mihdalal opened 6 months ago

mihdalal commented 6 months ago

I have been dealing with a particularly strange submitit error that I am having trouble understanding. Specifically, all jobs I launch through submitit die after 7-10 hours without error. However, this only happens on our cluster with slurm 19.05 and does not occur on a different cluster with slurm 20.11 (there the jobs run fine for the entire allotted time). Are there specific settings in slurm that are needed for submitit to work? Is submitit incompatible with slurm 19.05? Also note this is an error specific to launching jobs on slurm with submitit, I can manually launch sbatch jobs just fine and srun also works on my cluster.

Here is a minimum reproducible example:

launch_script:

import submitit

slurm_additional_parameters = {
    "partition": "russ_reserved",
    "time": "3-00:00:00",
    "gpus": 1,
    "cpus_per_gpu": 20,
    "mem": 62,
}

def test():
    while True:
        pass

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="test_cluster_log")
# set timeout in min, and partition for running the job
slurm_additional_parameters["job_name"] = "test_cluster"
executor.update_parameters(slurm_additional_parameters=slurm_additional_parameters)
job = executor.submit(test)  # will
print(job.job_id)  # ID of your job

output:

slurmstepd: error: *** STEP 250338.0 ON matrix-2-1 CANCELLED AT 2024-02-10T03:41:37 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** JOB 250338 ON matrix-2-1 CANCELLED AT 2024-02-10T03:41:37 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
submitit WARNING (2024-02-10 03:41:37,635) - Bypassing signal SIGCONT
submitit WARNING (2024-02-10 03:41:37,636) - Bypassing signal SIGTERM

submitit version: 1.5.1

hannaribaspeeters commented 2 months ago

Hi, did you manage to solve this issue? I am encountering the same problem.