Pod crash - Githubissues

nemanjajanojkic commented 2 years ago

After succeded or failed job, pod where is vsts agent goes to crash loop.

My OpenShift Version: 4.8.35

VSTS agent version: 2.204.0

I added and powershell module in container : dnf install -y https://github.com/PowerShell/PowerShell/releases/download/v7.2.5/powershell-lts-7.2.5-1.rh.x86_64.rpm

Exception: Scanning for tool capabilities. Connecting to the server. Successfully replaced the agent Testing agent connection. 2022-07-01 08:10:28Z: Settings Saved. Starting Agent listener interactively Started listener process Started running service Scanning for tool capabilities. Connecting to the server. 2022-07-01 08:10:30Z: Listening for Jobs 2022-07-01 08:13:07Z: Running job: Agent job 2022-07-01 08:13:22Z: Job Agent job completed with result: Succeeded Agent listener exited with error code 0 Agent listener exit with 0 return code, stop the service, no retry needed.

ygirouardstm commented 9 months ago

This is not a crash, the agent is running in "once" mode, just edit the Dockerfile and change the ENTRYPOINT to remove the --once switch. I don't know why it's configured like that actually, it means the pod would always restart after a job completes... It makes no sense to do that.

bfarr-rh commented 9 months ago

I need to dig into it, but i remember I may have gone with once to ensure the working directory on the container was not reused for different jobs

ygirouardstm commented 9 months ago

I need to dig into it, but i remember I may have gone with once to ensure the working directory on the container was not reused for different jobs

The azure agent creates a new job directory for each job as far as I know. However, some of the capabilities that are installed for a specific job gets reused by other jobs, and I think it's a good thing. It speeds up the subsequent jobs. That is why you can configure agent pool maintenance from Azure Devops (in the org's agent pool settings). Why would you not want this exactly?

Having the agent using the --once switch means that it will become unavailable for a short time after every job that it runs as it needs to reconnect to the pool. It also means that the Agent TOKEN (the PAT), which has a max life of 364 days (default is 30), will be used every time, so there are more chances that the agent will not start because the PAT has expired for example.

It's probably better to leave the agent running persistently and only restart it if a maintenance or update is needed.

bfarr-rh / azure-devops-ocp-agent

Pod crash #5