DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
112 stars 174 forks source link

Adding Container Support #3381

Open sfayer opened 7 years ago

sfayer commented 7 years ago

Hi,

(Further to the DIRAC user workshop) these are my ideas for modifications to DIRAC to add container support for discussion...

[For GridPP we've got a workaround for the containers now, so this is lower priority for us than we initially indicated].

The goal would to be to enable DIRAC to use containers in the following three ways:

  1. Within the pilot, effectively as another type of ComputeElement for isolation (like sudo/glexec).
  2. As a way to provide alternative platforms on a WN (e.g. so an EL6 job can run in a container on an EL7 WN).
  3. (Optionally) Allowing users to run their own containers.

There are a number of containers systems available, so the support would need to be modular to enable new systems to added: I think initially Docker & Singularity support would cover the majority of use cases for now.

Configuration System

A global flag should be added in Operations to set whether UserContainers are supported or not. There should also be an option for setting the maximum size of the user container (and any other relevant limits?).

Each CE could have a boolean flag to indicate whether it is expected to support containers: /Resources/Sites/LCG/\<SiteName>/CEs/\<CEName>/Containers = True/False If a site doesn't have a Containers field, then it would be assumed to be False.

Each SiteDirector could also have a "Containers = \<boolean>" flag to enable container support for a given VO. This would also default to False if not specified.

A list of supported container platforms would be added. In the configuration this will be a string: i.e.: /Resources/ContainerPlatforms = Singularity, Docker Each ContainerPlatform should correspond to a matching python module within DIRAC.

A new dictionary of mappings would be added, this maps platforms to container paths (this will be known as the container map): /Resources/Containers \<Platform> = (\<module>:)\<Path>, e.g.: EL6 = Singularity:/cvmfs/cernvm-prod.cern.ch/cvm3, Docker:docker.io:/some/path

Site Director

If the SiteDirector's Containers config setting is True then the following would be done, otherwise there would be no change in behaviour.

When a site director receives a job/jobqueue it submits pilots jobs to a site as usual. During the selection phase, it filters CEs that only support the job's required platform. This filter would need adjusting so that if "Containers = True", then all of platforms from the container map should be added to the platform supported by the CE. The raw platforms (i.e. all of the entries from ContainerPlatforms) should also be added if UserContainers is true.

Job Submission API (UserContainers only)

The Job API should be modified so that if a user submits a JDL containing a "Container = \<path>" and a "Platform" which matches a ContainerPlatform and UserContainers = True in the CS, then the job should be accepted. If Platform isn't a ContainerPlatform, then having a Container entry is an error. Setting Platform to a ContainerPlatform without including a Container = \<path> entry is also an error.

Job Verification (UserContainers only)

If a job is using a UserContainer, then it should be checked against the limits specified in the CS. This could be done with an extra optimzer module or something like that. The checks themselves would need to be done via the Container module as it would likely be system specific (the docker module will need to take into account all of the layers to measure image size for example).

Pilot (outer pilot)

On starting, the pilot should contain a module which searches for container support for all of the "ContainerPlatform" modules... For each one detected, these should be added into a new "ContainerSupport" field which is sent back to the matcher.

Matcher

The matcher receives the pilot information, including the new "ContainerSupport" variable. When selecting a job, instead of just using the pilot platform, if there is one of more container frameworks on the list, then all of the container map platforms and supported ContainerPlatforms should also be considered.

Pilot (inner pilot)

Once the matcher has selected a job, it is returned to the pilot. The pilot should then do the following things:

Regards, Simon

chaen commented 7 years ago

I know pretty much nothing in all that aspect of DIRAC, but from the little I know, this seems pretty well crafted ! :-)

fstagni commented 7 years ago

Excellent plan of work.

What you call the "Inner Pilot" looks, to me, like what we call a "computing element". By default Job Agents subs submit jobs to the "InProcess" computing element, or, as you also say, to a SudoCE. We would then need to create a "SingularityCE" (for example).

marianne013 commented 1 year ago

Given that SingularityCE is now used in production, shouldn't this issue be closed ?

fstagni commented 1 year ago

Well the original plan was more elaborated, but if there's no real use case after these years, then we can close.

fstagni commented 2 months ago

Few discussions (re)surfaced lately (CC @zhangxiaomei and @andresailer ), so I try a quick summary.

Containerization helps in 2 cases:

For isolation, you just need "a" container where Dirac works, let's say Alma9. For the application you need more precision, and "it depends". The 2 cases do not exclude the others, and in fact it is common (at least in LHCb) to run containers (for applications) inside containers (for isolation).

A few points for the application containers, since at the moment Dirac does not provide with any help:

For the platform part, LHCb developed a simple library: https://gitlab.cern.ch/lhcb-core/LbPlatformUtils which maybe could be useful for someone else. Up to @zhangxiaomei and @andresailer to try it out.

zhangxiaomei commented 1 month ago

@fstagni Something is not clear to me:

  1. If to allow applications choose their own containers, do we need to extend JDL to allow such supports? The one Simon mentioned in the "Job Submission API (UserContainers only)" section @sfayer
  2. If to make things simpler, is it possible to allow users to choose only few containers defined officially by experiments, not private one? The JUNO user cases is like this: During transitions from centOS7 to alma9, some jobs will run in centOS7 and others will run in alma9. In this case, we still need to describe the platform? I am not very convinced we need the platform.
fstagni commented 1 month ago
  1. If to allow applications choose their own containers, do we need to extend JDL to allow such supports? The one Simon mentioned in the "Job Submission API (UserContainers only)" section @sfayer

Not necessarily

  1. If to make things simpler, is it possible to allow users to choose only few containers defined officially by experiments, not private one? The JUNO user cases is like this: During transitions from centOS7 to alma9, some jobs will run in centOS7 and others will run in alma9. In this case, we still need to describe the platform? I am not very convinced we need the platform.

Do you use the SingularityCE?

zhangxiaomei commented 1 month ago
  1. If to allow applications choose their own containers, do we need to extend JDL to allow such supports? The one Simon mentioned in the "Job Submission API (UserContainers only)" section @sfayer

Not necessarily If not, how the applications choose its own container, inside it?

  1. If to make things simpler, is it possible to allow users to choose only few containers defined officially by experiments, not private one? The JUNO user cases is like this: During transitions from centOS7 to alma9, some jobs will run in centOS7 and others will run in alma9. In this case, we still need to describe the platform? I am not very convinced we need the platform.

Do you use the SingularityCE? We are using Pool/SingularityCE. 图片

fstagni commented 3 weeks ago

I would refrain from adding something to the JDL: the dirac executable (wrote by the user) can take care of starting the job inside a container, if needed. Users might want to do very different things, and I doubt we can find a common ground. The containers can be found, can be downloaded on the fly, can be pre-uploaded, etc, and all this can be requested to be done on resources with limited connectivity and so on. Honestly, not something I would embark in, not as a general DIRAC development.

The SingularityCE means that all your jobs would run inside the container anyway.

zhangxiaomei commented 3 weeks ago

I see. We are going to have a try on launching a user-required container inside app tools or systems (eg. production). It could be easier.

fstagni commented 2 weeks ago

This comment is mostly for adding some info on non-DIRAC related stuff, that might be useful for non-CERN experiments:

Following https://indico.cern.ch/event/1318715/contributions/5912142/attachments/2845423/4974903/RCS-IT%20Technical%20Committee_%20Registry.pdf

CERN IT decided to

Develop a pilot for a container registry serving a distributed computing infrastructure with cache replicas at multiple locations: It should include, at least, a main replica at CERN with external caches but part of the WLCG infrastructure (i.e. US/BNL and others) The pilot should build on CVMFS and in particular the previous efforts on integration the Harbor registry at CERN with unpacked CVMFS

Follow-up in https://docs.google.com/document/d/1ibV40EFbOckENZzsxbbQpcpgt8mBZ__yDSmtIDHF5kQ in case anyone's interested.