antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.67k stars 369 forks source link

Cleanup of kind cluster #6768

Open jainpulkit22 opened 3 weeks ago

jainpulkit22 commented 3 weeks ago

Describe the bug The CI jobs fail because of panic in the cleanup of existing kind cluster. Because in the current implementation of cleanup function for kind cluster, the code tries to get the creation timestamp of all the available kind clusters using the command kubectl get nodes --context kind-$kind_cluster_name -o json -l node-role.kubernetes.io/control-plane | \ jq -r '.items[0].metadata.creationTimestamp' , and sometimes there may be other job running on the same vm that has just started and the cluster creation is in process so the context is not ready but when another job tries to create the cluster it will stuck in this step and that job will panic and fail.

Not only in case of parallel job runs, but also if some job is aborted in the cluster creation phase the context of the kind cluster will not be available and whenever any new job will run on this testbed and will run the cleanup function the job will fail because it will try to fetch the context of clusters listed by kind get clusters using the above command and will panic causing the job to fail.

To Reproduce Trigger two kind jobs at the same time on same vm, or trigger one job and then as soon as the cluster creation starts, abort the job and then trigger a new job on the same testbed the second job will fail because of panic in both the cases.

Expected The jobs should not fail and cluster creation should be successful.

Actual behavior The job fails

Additional context Reference to current implementation of cleanup function: clean_kind

5753

rajnkamr commented 3 weeks ago

Duplicate of #5753

jainpulkit22 commented 3 weeks ago

Duplicate of #5753

This is a different issue, this issue is in the implementation of context based cleanup of clusters. The issue you have mentioned is already taken care now this issue is a bug in the implementation of the issue pointed out by you. Also this is not related to cleanup of antrea installation it is related to deletion of cluster or basically we can say cleanup of testbed that happens before start of the test.