NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
592 stars 165 forks source link

Support for installing client sites on HPC systems #2595

Open dirkpetersen opened 3 months ago

dirkpetersen commented 3 months ago

Is your feature request related to a problem? Please describe.

Some organizations have all their GPUs allocated in HPC systems and find it difficult to allocate dedicated GPU servers to NVFlare. Currently the use on HPC systems is undocumented.

Describe the solution you'd like In an ideal world, a client would be installed on a virtual machine which then submits jobs to an HPC system to prevent that the GPU is allocated for long periods of time without being used.

Describe alternatives you've considered I currently use this workaround and describe some of the issues with running on HPC systems (Slurm in this case) https://github.com/dirkpetersen/nvflare-cancer#install-a-client-on-hpc

YuanTingHsieh commented 3 months ago

Hi @dirkpetersen thanks for bringing this topic up!

One workaround as you said is to run a NVFlare client process (client_train.py) directly in your SLURM cluster node. This is the client monitoring process (CP) and it usually just waiting for jobs to come and spawn a job process (CJ) to handle this job. As you observed, this CP will be waiting for jobs and it is not using the GPU. It is the CJ that might be using the GPU, so ideally, we want to run CP outside of the GPU node, and only starts the process to be run in the GPU node when we need to.

We do have several ways to achieve that in HPC cluster.

  1. Use 2.4.1:
  1. Use main branch:

As we are seeing more interests, we might write out a whole tutorial/reference implementation on this later.

dirkpetersen commented 3 months ago

Awesome, this is very helpful and I will try that !

dirkpetersen commented 3 months ago

@YuanTingHsieh, it seems that both options require that client_train.py will run on the HPC login node, I think that is a reasonable assumption at least for many life sciences HPC systems that tend to have beefy login nodes and tolerant HPC admins. There are other disciplines where login nodes are guarded more strictly and it may not be allowed to run an agent. For those it would be better to have client_train.py run on a system adjacent to HPC and then submit the job via ssh and sbatch (for example using paramiko) . Perhaps a lower priority right now as I understand that FL use cases are focusing on life sciences right now ?

YuanTingHsieh commented 2 months ago

@dirkpetersen thanks for the discussion!

Yes, as you said, if you have a mechanism to submit the job via ssh and sbatch from machine A to your HPC system. Then you can run the NVFlare client (client_train.py) on that machine A.

Note that, you can also start NVFlare client using the "start.sh" in our startup kits, as you can see we add some restarting mechanisms inside that script as well.