StatCan / openmpp

Implementing the OpenM++ microsimulation framework as a Kubernetes service on the StatCan cloud.
0 stars 1 forks source link

Implement MPI Job clean-up Cron Job #40

Open KrisWilliamson opened 7 months ago

KrisWilliamson commented 7 months ago

Continuation of https://github.com/StatCan/openmpp/issues/37

KrisWilliamson commented 7 months ago

Proposed implementation

---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: openmpp-uat
  name: mpi-cleanup
rules:
- apiGroups:
  - extensions
  - apps
  resources:
  - deployments
  - replicasets
  verbs:
  - 'patch'
  - 'get'
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: mpi-cleanup
  namespace: openmpp-uat
subjects:
- kind: ServiceAccount
  name: sa-mpi-cleanup
  namespace: openmpp-uat
roleRef:
  kind: Role
  name: mpi-cleanup
  apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-mpi-cleanup
  namespace: openmpp-uat
---
apiVersion: batch/v1
kind: CronJob
metadata: 
name: mpiCleanup
namespace:  openmpp-uat
spec:
  schedule: "* 6 * * 0"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: sa-mpi-cleanup
          containers:
          - name: hello
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - kubectl get mpijobs -o go-template --template '{{range .items}}{{.metadata.name}} {{.metadata.creationTimestamp}}{{"\n"}}{{end}}' | awk '$2 <= "'$(date -d'now-24 hours' -Ins --utc | sed 's/+0000/Z/')'" { print $1 }' | xargs --no-run-if-empty kubectl delete mpijob
          restartPolicy: OnFailure

There are placeholders in the above example, such as Service account sa-mpi-cleanup and the roles, etc.

Also a decision will need to be made on when the cron job is to run and how old the jobs have to be to be cleaned up (currently once a week and 24 hours old)

Souheil-Yazji commented 7 months ago

This is good, but we don't want this solution to be limited to a namespace, rather it should be cluster-wide.

KrisWilliamson commented 7 months ago

Will do.
Also, can I get feedback on how often this should be run (daily, weekly?) and how old the jobs should be before they are deleted (1 day, 1 week, something else?)

Souheil-Yazji commented 7 months ago

We can run the job at midnight and delete any MPI jobs older than 7 days. We can start with that and if needed modify later.