facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.3k stars 124 forks source link

`tasks_per_node=1` does not keep the number of tasks to 1 for the `LocalExecutor` #1685

Open ihowell opened 2 years ago

ihowell commented 2 years ago

The expected behavior of this parameter setting when using the LocalExecutor (or in my case the AutoExecutor on a non-slurm node) would be to keep the number of spawned processes to 1. I use executor.batch() to perform a delayed batch, which then spawns a processes for each job, which quickly overwhelms my computer.

The issue seems to be that a controller process is spawned per job: https://github.com/facebookincubator/submitit/blob/main/submitit/local/local.py#L163 Each controller processes immediately spawns and runs the controller instead of checking if the number of running controllers is less than the number of tasks allowed.

chirico85 commented 2 years ago

Hi @ihowell , I have the same issue. Did you find any solution? Regards

Edit: I was running the tasks from within another repo thinking that it must pass the right parameters. However running the tasks as explained in examples, solved my struggle... -.-

jobs = []
with executor.batch():
    for arg in whatever:
        job = executor.submit(myfunc, arg)
        jobs.append(job)
gwenzek commented 2 years ago

In general the LocalExecutor has less feature than the SlurmExecutor and indeed if you start 100 jobs using LocalExecutor they will all run at once without regard for the hardware requested or the hardware available on your machine. In short we haven't implemented a queue for LocalExecutor. This is a major footgun, but also not something easily fixable, will need to think about it how to implement this: eg I feel we would like to spawn the subprocess ASAP to be able to return a process id which serves as job id, but make sure the jobs actually start one after the other.

Personally I often use the DebugExecutor which will run exactly one job at once in the current process.

ihowell commented 2 years ago

Thanks for the tip. I would however like to be able to run say 4 jobs at once (number of cores on my machine). Maybe we could use the multiprocessing library instead of subprocesses? This would allow us to use the semaphore while still returning a job construct with a process id I believe.

alirezakazemipour commented 1 year ago

Hi! Is there any updates on this? Is it solved? The same thing happens using Slurm launcher in hydra on clusters for me!