MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.22k stars 21.38k forks source link

Azure Devops Run-Once Self-hosted Agents in AKS - CrashLoopBackOff #95961

Open amgmdz opened 2 years ago

amgmdz commented 2 years ago

We're running Linux-based (Ubuntu) ADO agents hosted on AKS clusters. The agents were created by following steps in this doc: https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#create-and-build-the-dockerfile-1

Our agent pools have entrypoint "start.sh" script is modified for the agent to run once to ensure that agents are refreshed after every run.:

./run-docker.sh interactive --once && config.sh remove & wait $! cleanup

When the agents finish the job and exit, they exit on code '0', but regardless, Kubernetes treats this as a "Crash" and applies an exponential Back Off policy on restarts.

I understand this is a K8S problem that may be beyond ADO's control, because there is no way to tune the BackOff / RestartPolicy. But we were wondering if there could be any workarounds for this issue.

This problem arises because Azure Devops cannot communicate with private AKS and other private resources. This forces the use of self-hosted agents.


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

MarileeTurscak-MSFT commented 2 years ago

@amgmdz Thanks for your feedback! We will investigate and update as appropriate.

kohithms commented 1 year ago

Hi @MarileeTurscak-MSFT / @Karishma-Tiwari-MSFT , Any update on this issue?

amgmdz commented 1 year ago

Hello! @kohithms I have applied a temporary solution that solves the problem. In the agent base image I have installed the openshift cli and added the following lines at the end of the start.sh script. I used the token of the default service account.

oc login --token=$TOKEN --server=$APIAKS && kubectl delete pod $HOSTNAME -n $NAMESPACE

This at the end of each pipeline deletes the pod and recreates a new one.

ccoder83 commented 1 year ago

@amgmdz That seems like quite a neat solution that could work for me too, as I'm facing similar issues.

I have one question - Why did you use the openshift cli rather than az cli? As I understand, you are using ADO agents hosted on AKS clusters