NVIDIA / NeMo-Run

A tool to configure, launch and manage your machine learning experiments.
Apache License 2.0
78 stars 20 forks source link

Allow running multiple nemo run tasks in parallel with DockerExecutor #57

Open Kipok opened 2 months ago

Kipok commented 2 months ago

By this I mean that I will run multiple isolated scripts

python script1.py &
python script2.py &
...
wait

Currently when trying to do this, I get an error like below

───────────────────────────────────────────────────────────────────── Entering Experiment llm-math-judge with id: llm-math-judge_1726789456 ──────────────────────────────────────────────────────────────────────
[16:44:16] Launching task nemo-run for experiment llm-math-judge                                                                                                                                 experiment.py:601
[16:44:21] Error running task nemo-run: 409 Client Error for http+docker://localhost/v1.46/containers/create?name=nemo-run-0: Conflict ("Conflict. The container name "/nemo-run-0" is already   experiment.py:622
           in use by container "7591568f4b184e6134be9b92f4434c06242ca96d86654346854feb627028686a". You have to remove (or rename) that container to be able to reuse that name.")                                 
           Traceback (most recent call last):                                                                                                                                                    experiment.py:623
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/client.py", line 275, in _raise_for_status                                                                      
               response.raise_for_status()                                                                                                                                                                        
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status                                                                        
               raise HTTPError(http_error_msg, response=self)                                                                                                                                                     
            requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.46/containers/create?name=nemo-run-0                                                                    

           The above exception was the direct cause of the following exception:                                                                                                                                   

            Traceback (most recent call last):                                                                                                                                                                    
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/experiment.py", line 616, in run                                                                              
               job.launch(wait=wait, runner=self._runner)                                                                                                                                                         
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/job.py", line 340, in launch                                                                                  
               handle, status = launch(                                                                                                                                                                           
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/launcher.py", line 99, in launch                                                               
               app_handle = runner.run(                                                                                                                                                                           
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/runner.py", line 87, in run                                                                    
               handle = self.schedule(dryrun_info)                                                                                                                                                                
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/runner.py", line 102, in schedule                                                              
               app_id = sched.schedule(dryrun_info)                                                                                                                                                               
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/schedulers/docker.py", line 109, in schedule                                                   
               req.run(client=client)                                                                                                                                                                             
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/core/execution/docker.py", line 328, in run                                                                       
               container_details.append(container.run(client=client, id=self.id))                                                                                                                                 
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/core/execution/docker.py", line 269, in run                                                                       
               return client.containers.run(                                                                                                                                                                      
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/models/containers.py", line 876, in run                                                                             
               container = self.create(image=image, command=command,                                                                                                                                              
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/models/containers.py", line 935, in create                                                                          
               resp = self.client.api.create_container(**create_kwargs)                                                                                                                                           
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/container.py", line 440, in create_container                                                                    
               return self.create_container_from_config(config, name, platform)                                                                                                                                   
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/container.py", line 457, in create_container_from_config                                                        
               return self._result(res, True)                                                                                                                                                                     
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/client.py", line 281, in _result                                                                                
               self._raise_for_status(response)                                                                                                                                                                   
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/client.py", line 277, in _raise_for_status                                                                      
               raise create_api_error_from_http_exception(e) from e                                                                                                                                               
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/errors.py", line 39, in create_api_error_from_http_exception                                                        
               raise cls(e, response=response, explanation=explanation) from e                                                                                                                                    
            docker.errors.APIError: 409 Client Error for http+docker://localhost/v1.46/containers/create?name=nemo-run-0: Conflict ("Conflict. The container name "/nemo-run-0" is already in                     
           use by container "7591568f4b184e6134be9b92f4434c06242ca96d86654346854feb627028686a". You have to remove (or rename) that container to be able to reuse that name.")