NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.71k stars 2.45k forks source link

Job specific environment variables can't be set in Hydra multi-run #9449

Closed domenVres closed 1 month ago

domenVres commented 3 months ago

Is your feature request related to a problem? Please describe.

Hydra offers a way to set the environment variables, that are specific for each job - section hydra.job.env_set here. However, when I added this to the configuration for my multi-run hyperparameter search, the environment variable was not changed. I suspect that the reason for that is that NeMo has a custom launcher, through which hydra executes the job. I did not find that these environment variables would be handled anywhere in there.

Describe the solution you'd like

I would like to be able to set job-specific environment variables in the same way that Hydra documentation describes it. This means I should be able to add field job.env_set in Hydra config for multi-runs (screenshot below).

Screenshot 2024-06-12 at 10 20 03

Describe alternatives you've considered

The only alternative that I found so far and that works is to manually set the environment variables inside the training script (each job is a training script). However, inside these scripts, I don't have access to the job ID, and hence, I had to infer it from the values of hyperparameters. This is far from ideal, and it would be much nicer to have a config-level solution.

Additional context

The main reason that I went into trouble with the environment variables is a problem with ports when training multiple models at once. The process is initialized on the master port, and consequently, the following error happens:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53394 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:53394 (errno: 98 - Address already in use).

The solution to avoid this error is to set the environment variable MASTER_PORT to a different value for each job. Without this option, running multiple training processes in parallel is impossible due to port collisions.

titu1994 commented 3 months ago

We don't use process launcher unless you use hydra sweep config. Can you try removing that ? If not, we'll have to see how to implement your request. Ofc, if you have a solution you're encouraged to send a pr

domenVres commented 3 months ago

Is there a way to perform a hyperparameter search without the hydra sweep?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 7 days since being marked as stale.