Research MPIJob Cleanup

Souheil-Yazji commented 10 months ago

We need a convenient way to clean up MPIjobs from the cluster.

Failed Jobs

We could possibly write failed job logs to a file then delete them right away.

Completed Jobs

Upon completion, write logs to file and then delete

KrisWilliamson commented 9 months ago

It is understood that this issue refers to the clean up of Kubernetes objects. Kubernetes Garbage collection https://kubernetes.io/docs/concepts/architecture/garbage-collection/ Kubernetes automatic clean up of finished jobs https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

KrisWilliamson commented 9 months ago

Adding the ttlSecondsAfterFinished flag to the https://github.com/StatCan/openmpp/blob/main/mpi-job-files/MPIJobTemplate.yaml , like this example will ensure the orphen jobs are cleaned up after completion (Error or Successful).

spec:
  ttlSecondsAfterFinished: 100
  template:

More research is needed to determine if more than one entry is needed for the MPIJobs template because there are multiple spec fields.

KrisWilliamson commented 9 months ago

A decision is needed for this issue.

Cleaning up the job after they finish, like proposed in the first comment, would require modifying the job launcher to clean up the jobs.
Adding the ttlSecondsAfterFinished flag would not require the job launcher to be modified, only the template.

Given the Job launcher is being rewritten, the first option seems appropriate, however the second option is a very quick fix and utilizes K8s built in functionality and could be combined with the rewrite.

KrisWilliamson commented 9 months ago

After some testing, I got the following error:

error when creating "MPIJob-.yaml": MPIJob in version "v1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.ttlSecondsAfterFinished"

This flag was added in Kubernetes v1.21 as experimental and stable in v1.23. The server version is: Server Version: v1.26.6`.

This may be an issue with the mpijobs implmentation.

KrisWilliamson commented 9 months ago

If the ttlSecondsAfterFinished flag is moved to the runPolicy field, it does not create an error.

  #slotsPerWorker: Tensorflow example sets this attribute, but Pat's example template doesn't.
  runPolicy:
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 100
  mpiReplicaSpecs:

However it does not seem to clean up the job either.

KrisWilliamson commented 9 months ago

according to this web-site https://elastisys.io/compliantkubernetes/user-guide/safeguards/enforce-job-ttl/ by default in Compliant Kubernetes, Jobs that do not explicitly set a TTL (spec.ttlSecondsAfterFinished) automatically get a TTL of 7 days.

This needs to be tested.

Also, the ttl flag does not explicitly set the time to clean-up, but ensure that garbage collection happens after the time has elapsed. Garbage collection is suppose to happen every two minutes, but can be set to a longer period by the cluster administrator, or set to another criteria, such as resources used.

After testing, it seems orphaned jobs are not cleaned up after 7 days, and the ttl flag is not respected at all.

KrisWilliamson commented 9 months ago

Also, although the flag does not cause an error, the training-operator may not implement it. The MPI-Operator spec say this was implemented in v2, but we are still using v1 (see template), which might be another issue.

KrisWilliamson commented 9 months ago

Another example shows the ttl field in the completionPolicy section.

Souheil-Yazji commented 9 months ago

One suggestion would be to create a k8s cronjob to delete 'mpijob' resourcesolder than x date in all namespaces, ran nightly

KrisWilliamson commented 9 months ago

First attempt at defining a K* cron job manifest

apiVersion: batch/v1
kind: CronJob
metadata: 
name: mpiCleanup
namespace:  openmpp-uat
spec:
  schedule: "* 6 * * 0"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - kubectl get mpijobs -o go-template --template '{{range .items}}{{.metadata.name}} {{.metadata.creationTimestamp}}{{"\n"}}{{end}}' | awk '$2 <= "'$(date -d'now-24 hours' -Ins --utc | sed 's/+0000/Z/')'" { print $1 }' | xargs --no-run-if-empty kubectl delete mpijob
          restartPolicy: OnFailure

The schedule: "* 6 * * 0" tells the script to run at Sunday midnight (offset 6 hours due to the server running on UTC)
The complicated kubectl command is explained below.

RBAC setting may be required.

KrisWilliamson commented 9 months ago

get mpijobs -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .status.conditions[1].lastUpdateTime| fromdate }]"

kubectl get mpijobs -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .status.conditions[1].lastUpdateTime| fromdate } | select(.status.conditions[1].lastUpdateTime < ( now | . - 3600))]" | jq -r ".[].name"

The first line gets a list of mpijobs and their lastUpdateTime (converted to seconds)

The second filters this to get the name of jobs that are older than 1 hour (3600 seconds)\

kubectl get mpijobs -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .status.conditions[1].lastUpdateTime| fromdate } | select(.status.conditions[1].lastUpdateTime < ( now | . - 3600))]" | jq -r ".[].name" | xargs -r -L1 kubectl delete mpijob

This example creates a list of mpijobs older than an our, and pipes the output to the kubectl delete mpijobs command via xargs

The now command does not seem to be working, meaning the select is returning everything.

KrisWilliamson commented 9 months ago

Updated command. This works, it deletes mpijobs greater than 1 hour old.

kubectl get mpijobs -o go-template --template '{{range .items}}{{.metadata.name}} {{.metadata.creationTimestamp}}{{"\n"}}{{end}}' | awk '$2 <= "'$(date -d'now-1 hours' -Ins --utc | sed 's/+0000/Z/')'" { print $1 }' | xargs --no-run-if-empty kubectl delete mpijob

KrisWilliamson commented 9 months ago

RBAC settings

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  #namespace: openmpp-uat
  name: mpi-cleanup
rules:
- apiGroups:
  resources:
  - mpijobs
  verbs:
  - 'delete'
  - 'get'
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: mpi-cleanup
  #namespace: openmpp-uat
subjects:
- kind: ServiceAccount
  name: sa-mpi-cleanup
  #namespace: openmpp-uat
roleRef:
  kind: Role
  name: mpi-cleanup
  apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-mpi-cleanup
  namespace: openmpp-uat

KrisWilliamson commented 9 months ago

At Charles request, this issue is being closed and continued in https://github.com/StatCan/openmpp/issues/40

KrisWilliamson commented 9 months ago

kubectl get job --all-namespaces

kubectl delete jobs --all-namespaces

KrisWilliamson commented 9 months ago

kubectl get mpijobs --all-namespaces Error from server (Forbidden): mpijobs.kubeflow.org is forbidden: User "system:serviceaccount:kristian-williamson:default-editor" cannot list resource "mpijobs" in API group "kubeflow.org" at the cluster scope

StatCan / openmpp

Research MPIJob Cleanup #37

Failed Jobs

Completed Jobs