DIRACGrid / diracx

The neXt DIRAC incarnation
GNU General Public License v3.0
9 stars 20 forks source link

support for submitting jobs to Kubernetes #181

Open rptaylor opened 11 months ago

rptaylor commented 11 months ago

Hello,

Do you envision that diracx might have support for submitting jobs to kubernetes clusters (as kubernetes-native batch/v1 jobs), similar to the kubernetes plugin of Harvester for Panda, along with submitting to traditional batch clusters?

Thanks!

fstagni commented 11 months ago

Hello, we have no experience/knowledge about kubernetes-native batch. At at first look it seems to me that could be just another plugin to add to https://github.com/DIRACGrid/DIRAC/tree/integration/src/DIRAC/Resources/Computing (in DIRAC, or later in DiracX, does not seem different to me).

What would it be the use case?

rptaylor commented 11 months ago

Hi @fstagni ,

Thanks for the info. Kubernetes is quite popular and, aside from providing a wide array of capabilities that are not possible in traditional batch systems, is also gaining in feature parity for batch system scheduling functionality. There are a few ATLAS T2 sites that are native kubernetes batch clusters thanks to the k8s plugin of Harvester for Panda that was developed in ~2018 or so (for reference: CHEP2023 presentation CHEP2023 paper). I was wondering if experiments adopting DIRAC would also be able to support kubernetes sites. Particularly for new experiments, it can be more feasible and attractive to start developing a distributed computing framework using modern cloud native technologies.

It looks like the development effort involved would mainly involve writing a KubernetesComputingElement.py file? Just curious at this point. Authentication to the Kubernetes API can be done with X509 certificates (not proxies) or OIDC and tokens, presumably Dirac already has some support for that?

Thanks!

fstagni commented 11 months ago

It looks like the development effort involved would mainly involve writing a KubernetesComputingElement.py file?

That would be the way to do. DIRAC supports through these plugins the traditional HTCondor and ARC CEs as well as "SSH" CEs and computing clouds (https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/Resources/Computing/CloudComputingElement.py which uses libcloud under the hood). DiracX will very likely use this DIRAC code for reaching the same goal, so this could be implemented in DIRAC already.

Authentication to the Kubernetes API can be done with X509 certificates (not proxies) or OIDC and tokens, presumably Dirac already has some support for that?

That should not be pose an issue.

Normally, since we are a small and busy group, we do not embark in developments without a requirement (from a VO using DIRAC). Questions:

rptaylor commented 11 months ago

Okay thanks. For know I was just gathering information to see how much work it would take, how much of a priority it might be, or if it would be straightforward for a potential contributor to work on, etc.

In ATLAS, the NET2 in the US is also k8s native, and the ATLAS Google Cloud project, and a site in Taiwan. Several other sites are also interested and experimenting; in total there are 7 Panda queues for kubernetes in ATLAS. I'm not sure what other VOs they might support, but if ATLAS is the only VO using a workflow management system (Panda + Harvester) that supports Kubernetes (as far as I know, could be wrong), that would limit the options for adoption by other VOs. As for new experiments, SKAO is looking into a kubernetes-based approach and has considered using DIRAC.