cea-hpc / pcocc

Run VMs on an HPC cluster
GNU General Public License v3.0
47 stars 14 forks source link

Extend existing PCOCC cluster with new VMs #14

Open kees-closed opened 5 years ago

kees-closed commented 5 years ago

I have the following goal: Provide an isolated interactive node from where a user can submit new PCOCC jobs and log into them.

Setting up an isolated interactive VM can be done by submitting a PCOCC job with a single VM. That part is no problem with PCOCC. However, once users want to submit new jobs from this interactive VM, there is a network isolation inplace which prohibits users to interactive to newly instantiated PCOCC VMs. This isolation is of course by design.

However, having the ability to instantiate a new PCOCC job from this interactive VM and connect to it, would allow more flexibility for isolated clusters. I therefore would like to know the feasibility of this feature. Would it be possible to use Open vSwitch to interconnect running PCOCC jobs?

This may be possible by providing the job ID of the interactive VM in a PCOCC command and then request more resources. Then the nodes of these job IDs can perhaps be interconnected via an Open vSwitch bridge. Which then provides the means to interactive to these newly instantiated VMs.

I suppose the active community has better insight into the following questions:

  1. Is such a setup feasible by using e.g. Open vSwitch? If so, how?
  2. Would this be a feature worth considering?
fdiakh commented 5 years ago

Dynamically extending virtual networks so that they may span multiple jobs is a feature that we've also considered but haven't had time to implement. The main issue is that the current implementation is very simple as all the networking setup is done in one go within Slurm prolog and epilog steps. So when a second job start on new nodes there is no callback on the first nodes which would let you update the network configuration. Other than that, updating the Open vSwitch rules themselves shouldn't be too difficult but it would require significant refactoring to be able to do it.