Closed Souheil-Yazji closed 9 months ago
It is understood that this issue refers to the clean up of Kubernetes objects. Kubernetes Garbage collection https://kubernetes.io/docs/concepts/architecture/garbage-collection/ Kubernetes automatic clean up of finished jobs https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/
Adding the ttlSecondsAfterFinished
flag to the https://github.com/StatCan/openmpp/blob/main/mpi-job-files/MPIJobTemplate.yaml , like this example will ensure the orphen jobs are cleaned up after completion (Error or Successful).
spec:
ttlSecondsAfterFinished: 100
template:
More research is needed to determine if more than one entry is needed for the MPIJobs template because there are multiple spec fields.
A decision is needed for this issue.
Given the Job launcher is being rewritten, the first option seems appropriate, however the second option is a very quick fix and utilizes K8s built in functionality and could be combined with the rewrite.
After some testing, I got the following error:
error when creating "MPIJob-.yaml": MPIJob in version "v1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.ttlSecondsAfterFinished"
This flag was added in Kubernetes v1.21 as experimental and stable in v1.23. The server version is: Server Version: v1.26.6`.
This may be an issue with the mpijobs implmentation.
If the ttlSecondsAfterFinished
flag is moved to the runPolicy field, it does not create an error.
#slotsPerWorker: Tensorflow example sets this attribute, but Pat's example template doesn't.
runPolicy:
cleanPodPolicy: Running
ttlSecondsAfterFinished: 100
mpiReplicaSpecs:
However it does not seem to clean up the job either.
according to this web-site https://elastisys.io/compliantkubernetes/user-guide/safeguards/enforce-job-ttl/
by default in Compliant Kubernetes, Jobs that do not explicitly set a TTL (spec.ttlSecondsAfterFinished) automatically get a TTL of 7 days.
This needs to be tested.
Also, the ttl flag does not explicitly set the time to clean-up, but ensure that garbage collection happens after the time has elapsed. Garbage collection is suppose to happen every two minutes, but can be set to a longer period by the cluster administrator, or set to another criteria, such as resources used.
After testing, it seems orphaned jobs are not cleaned up after 7 days, and the ttl flag is not respected at all.
Also, although the flag does not cause an error, the training-operator may not implement it. The MPI-Operator spec say this was implemented in v2, but we are still using v1 (see template), which might be another issue.
Another example shows the ttl field in the completionPolicy section.
One suggestion would be to create a k8s cronjob to delete 'mpijob' resourcesolder than x date in all namespaces, ran nightly
First attempt at defining a K* cron job manifest
apiVersion: batch/v1
kind: CronJob
metadata:
name: mpiCleanup
namespace: openmpp-uat
spec:
schedule: "* 6 * * 0"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- kubectl get mpijobs -o go-template --template '{{range .items}}{{.metadata.name}} {{.metadata.creationTimestamp}}{{"\n"}}{{end}}' | awk '$2 <= "'$(date -d'now-24 hours' -Ins --utc | sed 's/+0000/Z/')'" { print $1 }' | xargs --no-run-if-empty kubectl delete mpijob
restartPolicy: OnFailure
schedule: "* 6 * * 0"
tells the script to run at Sunday midnight (offset 6 hours due to the server running on UTC)RBAC setting may be required.
get mpijobs -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .status.conditions[1].lastUpdateTime| fromdate }]"
kubectl get mpijobs -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .status.conditions[1].lastUpdateTime| fromdate } | select(.status.conditions[1].lastUpdateTime < ( now | . - 3600))]" | jq -r ".[].name"
The first line gets a list of mpijobs and their lastUpdateTime (converted to seconds)
The second filters this to get the name of jobs that are older than 1 hour (3600 seconds)\
kubectl get mpijobs -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .status.conditions[1].lastUpdateTime| fromdate } | select(.status.conditions[1].lastUpdateTime < ( now | . - 3600))]" | jq -r ".[].name" | xargs -r -L1 kubectl delete mpijob
This example creates a list of mpijobs older than an our, and pipes the output to the kubectl delete mpijobs command via xargs
The now command does not seem to be working, meaning the select is returning everything.
Updated command. This works, it deletes mpijobs greater than 1 hour old.
kubectl get mpijobs -o go-template --template '{{range .items}}{{.metadata.name}} {{.metadata.creationTimestamp}}{{"\n"}}{{end}}' | awk '$2 <= "'$(date -d'now-1 hours' -Ins --utc | sed 's/+0000/Z/')'" { print $1 }' | xargs --no-run-if-empty kubectl delete mpijob
RBAC settings
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
#namespace: openmpp-uat
name: mpi-cleanup
rules:
- apiGroups:
resources:
- mpijobs
verbs:
- 'delete'
- 'get'
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: mpi-cleanup
#namespace: openmpp-uat
subjects:
- kind: ServiceAccount
name: sa-mpi-cleanup
#namespace: openmpp-uat
roleRef:
kind: Role
name: mpi-cleanup
apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sa-mpi-cleanup
namespace: openmpp-uat
At Charles request, this issue is being closed and continued in https://github.com/StatCan/openmpp/issues/40
kubectl get job --all-namespaces
kubectl delete jobs --all-namespaces
kubectl get mpijobs --all-namespaces
Error from server (Forbidden): mpijobs.kubeflow.org is forbidden: User "system:serviceaccount:kristian-williamson:default-editor" cannot list resource "mpijobs" in API group "kubeflow.org" at the cluster scope
We need a convenient way to clean up MPIjobs from the cluster.
Failed Jobs
We could possibly write failed job logs to a file then delete them right away.
Completed Jobs
Upon completion, write logs to file and then delete