OctopusDeploy / OctopusTentacle

| Public | The secure, lightweight, cross-platform agent for Octopus Server which turns any computer into a worker or deployment target for automated deployments and operations runbooks.
https://octopus.com
Apache License 2.0
11 stars 16 forks source link

Unable to perform any tasks on a Kubernetes Agent #1000

Closed cpressland closed 1 month ago

cpressland commented 1 month ago

Team

What happened?

When creating a Kubernetes Octopus Tentacle via either Helm or Flux the agent initially connects and registers itself as a deployment target, After this initial registration the agent refuses to perform any tasks. For example, when running a Health Check the task sits on Determining ScriptService version to use for six minutes before timing out.

Reproduction

Installation of the Kubernetes Agent helm chart at either 1.16.1 or 2.2.1. In the below example I'm using Flux but the same thing happens when using the steps provided by the Octopus Deploy Web UI.

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: OCIRepository
metadata:
  name: octopus-agent
  namespace: octopus-agent
spec:
  interval: 60m
  url: oci://docker.io/octopusdeploy/kubernetes-agent
  ref:
    tag: 2.2.1
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: octopus-agent
  namespace: octopus-agent
spec:
  interval: 10m
  releaseName: octopus-agent
  chartRef:
    kind: OCIRepository
    name: octopus-agent
  values:
    agent:
      acceptEula: "Y"
      name: ${target_name}
      serverUrl: https://octopus.<snipped>/
      serverCommsAddress: https://octopus.<snipped>:10943/
      space: Default
      serverApiKeySecretName: octopus-api-key
      deploymentTarget:
        enabled: "true"
        initial:
          environments: ["${target_environment}"] # Variable Subtition replaces this
          tags: ["${target_role}"]  # Variable Subtition replaces this
          tenantTags: ["${target_tenant_tag}"]  # Variable Subtition replaces this
          tenants: ["${target_tenant}"]  # Variable Subtition replaces this
          tenantedDeploymentParticipation: "TenantedOrUntenanted"
    persistence:
      storageClassName: azurefile

Error and Stacktrace

Determining ScriptService version to use 
The task was canceled 
The task was canceled Script execution was cancelled

System.OperationCanceledException: Script execution was cancelled

 ---> Halibut.Exceptions.ConnectingRequestCancelledException: The Request was cancelled while Connecting.

 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.

   at Nito.AsyncEx.TaskExtensions.DoWaitAsync(Task task, CancellationToken cancellationToken)

   at Halibut.ServiceModel.PendingRequestQueueAsync.PendingRequest.WaitForResponseToBeSet(Nullable`1 timeout, Boolean cancelTheRequestWhenTransferHasBegun, CancellationToken cancellationToken)

   --- End of inner exception stack trace ---

   at Halibut.ServiceModel.PendingRequestQueueAsync.PendingRequest.WaitForResponseToBeSet(Nullable`1 timeout, Boolean cancelTheRequestWhenTransferHasBegun, CancellationToken cancellationToken)

   at Halibut.ServiceModel.PendingRequestQueueAsync.PendingRequest.WaitUntilComplete(CancellationToken cancellationToken)

   at Halibut.ServiceModel.PendingRequestQueueAsync.PendingRequest.WaitUntilComplete(CancellationToken cancellationToken)

   at Halibut.ServiceModel.PendingRequestQueueAsync.QueueAndWaitAsync(RequestMessage request, CancellationToken cancellationToken)

   at Halibut.ServiceModel.PendingRequestQueueAsync.QueueAndWaitAsync(RequestMessage request, CancellationToken cancellationToken)

   at Halibut.HalibutRuntime.SendOutgoingPollingRequestAsync(RequestMessage request, CancellationToken cancellationToken)

   at Halibut.HalibutRuntime.SendOutgoingRequestAsync(RequestMessage request, MethodInfo methodInfo, CancellationToken cancellationToken)

   at Halibut.ServiceModel.HalibutProxyWithAsync.MakeRpcCall(MethodInfo asyncMethod, Object[] args)

   at Halibut.ServiceModel.HalibutProxyWithAsync.InvokeAsyncT[T](MethodInfo asyncMethod, Object[] args)

   at System.Reflection.AsyncDispatchProxyGenerator.InvokeAsync[T](Object[] args)

   at Octopus.Tentacle.Contracts.Capabilities.BackwardsCompatibleAsyncClientCapabilitiesV2Decorator.<>c__DisplayClass2_0.<<GetCapabilitiesAsync>b__0>d.MoveNext()

--- End of stack trace from previous location ---

   at Octopus.Tentacle.Contracts.Capabilities.BackwardsCompatibleCapabilitiesV2Decorator.WithBackwardsCompatabilityAsync(Func`1 inner)

   at Octopus.Tentacle.Contracts.Capabilities.BackwardsCompatibleAsyncClientCapabilitiesV2Decorator.GetCapabilitiesAsync(HalibutProxyRequestOptions halibutProxyRequestOptions)

   at Octopus.Tentacle.Client.Scripts.ScriptOrchestratorFactory.<DetermineScriptServiceVersionToUse>g__GetCapabilitiesFunc|15_0(CancellationToken ct)

   at Octopus.Tentacle.Client.Execution.RpcCallExecutor.<>c__DisplayClass7_0`1.<<ExecuteWithRetries>b__0>d.MoveNext()

--- End of stack trace from previous location ---

   at Octopus.Tentacle.Client.Retries.RpcCallRetryHandler.<>c__DisplayClass9_0`1.<<ExecuteWithRetries>g__ExecuteAction|2>d.MoveNext()

--- End of stack trace from previous location ---

   at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func`3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates`1 shouldRetryResultPredicates, Func`5 onRetryAsync, Int32 permittedRetryCount, IEnumerable`1 sleepDurationsEnumerable, Func`4 sleepDurationProvider, Boolean continueOnCapturedContext)

   at Polly.AsyncPolicy.ExecuteAsync[TResult](Func`3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext)

   at Octopus.Tentacle.Client.Retries.RpcCallRetryHandler.ExecuteWithRetries[T](Func`2 action, OnRetryAction onRetryAction, OnTimeoutAction onTimeoutAction, CancellationToken cancellationToken)

   at Octopus.Tentacle.Client.Execution.RpcCallExecutor.ExecuteWithRetries[T](RpcCall rpcCall, Func`2 action, Action`1 onErrorAction, ITentacleClientTaskLog logger, ClientOperationMetricsBuilder clientOperationMetricsBuilder, CancellationToken cancellationToken)

   at Octopus.Tentacle.Client.Execution.RpcCallExecutor.Execute[T](Boolean retriesEnabled, RpcCall rpcCall, Func`2 action, Action`1 onErrorAction, ITentacleClientTaskLog logger, ClientOperationMetricsBuilder clientOperationMetricsBuilder, CancellationToken cancellationToken)

   at Octopus.Tentacle.Client.Execution.RpcCallExecutor.Execute[T](Boolean retriesEnabled, RpcCall rpcCall, Func`2 action, ITentacleClientTaskLog logger, ClientOperationMetricsBuilder clientOperationMetricsBuilder, CancellationToken cancellationToken)

   at Octopus.Tentacle.Client.Scripts.ScriptOrchestratorFactory.DetermineScriptServiceVersionToUse(CancellationToken cancellationToken)

   at Octopus.Tentacle.Client.Scripts.ScriptOrchestratorFactory.CreateOrchestrator(CancellationToken cancellationToken)

   --- End of inner exception stack trace ---

   at Octopus.Tentacle.Client.Scripts.ScriptOrchestratorFactory.CreateOrchestrator(CancellationToken cancellationToken)

   at Octopus.Tentacle.Client.TentacleClient.ExecuteScript(ExecuteScriptCommand executeScriptCommand, OnScriptStatusResponseReceived onScriptStatusResponseReceived, OnScriptCompleted onScriptCompleted, ITentacleClientTaskLog logger, CancellationToken scriptExecutionCancellationToken)

   at Octopus.Server.Orchestration.Targets.Tentacles.TentacleRemoteEndpointFacade.ExecuteCommand(ExecuteScriptCommand startScriptCommand, ITaskLog taskLog, CancellationToken cancellationToken) in ./source/Octopus.Server/Orchestration/Targets/Tentacles/TentacleRemoteEndpointFacade.cs:line 37

   at Octopus.Server.Orchestration.Targets.Common.RemoteEndpointFacadeCancellationTokenDecorator.ExecuteCommand(ExecuteScriptCommand command, ITaskLog taskLog, CancellationToken cancellationToken) in ./source/Octopus.Server/Orchestration/Targets/Common/RemoteEndpointFacadeCancellationTokenDecorator.cs:line 36

   at Octopus.Server.Orchestration.Targets.Tentacles.Observability.ErrorLoggingRemoteEndpointFacadeDecorator.ExecuteCommand(ExecuteScriptCommand command, ITaskLog taskLog, CancellationToken cancellationToken) in ./source/Octopus.Server/Orchestration/Targets/Tentacles/Observability/ErrorLoggingRemoteEndpointFacadeDecorator.cs:line 57

   at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionExecution.Immediate.ExecutionTargets.TentacleExecutionTarget.Execute(ExecuteScriptCommandBuilder commandBuilder, ScriptCollection bootstrapScripts, Boolean isRaw, ITaskLog taskLog, CancellationToken cancellationToken) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionExecution/Immediate/ExecutionTargets/TentacleExecutionTarget.cs:line 51

   at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionExecution.Immediate.ImmediateExecutor.ExecuteRawScript(ScriptCollection scripts, IReadOnlyList`1 files, Boolean isRaw, ITaskLog taskLog, CancellationToken cancellationToken, Nullable`1 isolationMutexTimeout, ExecutionIsolation isolationLevel) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionExecution/Immediate/ImmediateExecutor.cs:line 126

   at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionExecution.CommandBuilders.RawShellCommandBuilder.Execute(ITaskLog taskLog, CancellationToken cancellationToken) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionExecution/CommandBuilders/RawShellCommandBuilder.cs:line 36

   at Octopus.Server.Orchestration.ServerTasks.HealthCheck.Providers.MachineHealthProvider.<>c__DisplayClass19_0.<<ExecutePreCheckScript>b__0>d.MoveNext() in ./source/Octopus.Server/Orchestration/ServerTasks/HealthCheck/Providers/MachineHealthProvider.cs:line 139

--- End of stack trace from previous location ---

   at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionDispatch.AdHocActionDispatcher.InvokeActionHandler(Machine target, ActionHandlerInvocation actionHandler, ActionAndTargetScopedVariables actionAndTargetScopedVariables, IExecutor executor, TargetManifest targetManifest, ITaskLog taskLog, CancellationToken cancellationToken) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionDispatch/AdHocActionDispatcher.cs:line 240

   at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionDispatch.AdHocActionDispatcher.ExecuteOnDeploymentTarget(DeploymentTarget deploymentTarget, ActionHandlerInvocation actionHandler, TargetManifest targetManifest, ITaskLog taskLog, CancellationToken cancellationToken) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionDispatch/AdHocActionDispatcher.cs:line 162

   at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionDispatch.AdHocActionDispatcher.Dispatch(Machine machine, ActionHandlerInvocation actionHandler, ITaskLog taskLog, CancellationToken cancellationToken, VariableCollection variables) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionDispatch/AdHocActionDispatcher.cs:line 68

   at Octopus.Server.Orchestration.ServerTasks.HealthCheck.Providers.MachineHealthProvider.ExecutePreCheckScript(Machine machine, TimeSpan cancellationTimeout, CancellationToken linkedCts) in ./source/Octopus.Server/Orchestration/ServerTasks/HealthCheck/Providers/MachineHealthProvider.cs:line 137

   at Octopus.Server.Orchestration.ServerTasks.HealthCheck.Providers.MachineHealthProvider.PerformHealthCheckAndSaveResult(Machine machine, CancellationToken cancellationToken, ExceptionHandling exceptionHandling) in ./source/Octopus.Server/Orchestration/ServerTasks/HealthCheck/Providers/MachineHealthProvider.cs:line 89

Octopus.Server version 2024.2.9274 (2024.2.9274)
Recording health check results 

Pod Logs:

$ kubectl logs octopus-agent-tentacle-5d5b585c76-4qncc -f
Rehashing SSL/TLS certificates
rehash: warning: skipping ca-certificates.crt,it does not contain exactly one certificate or CRL
Checking registration status...
Tentacle is already configured and registered with server.
2024-09-09 14:39:44.3626 | DEBUG | Machine configuration home directory is /etc/octopus
2024-09-09 14:39:44.3755 | DEBUG | Loading configuration from ConfigMap for namespace octopus-agent
Disabling polling proxy
2024-09-09 14:39:45.2912 | DEBUG | Machine configuration home directory is /etc/octopus
2024-09-09 14:39:45.3056 | DEBUG | Loading configuration from ConfigMap for namespace octopus-agent
2024-09-09 14:39:45.7612 |  INFO | Proxy use is disabled
===============================================
Starting Octopus Deploy Kubernetes Tentacle
===============================================
2024-09-09 14:39:46.1646 | DEBUG | Machine configuration home directory is /etc/octopus
2024-09-09 14:39:46.1777 | DEBUG | Loading configuration from ConfigMap for namespace octopus-agent
2024-09-09 14:39:46.6331 | DEBUG | Loading certificate with thumbprint: E7363A5076C2C5761CC708528AB274ED57F4C258
2024-09-09 14:39:46.6458 | DEBUG | Loading certificate from given string
2024-09-09 14:39:46.6971 | DEBUG | Loading certificate with thumbprint: E7363A5076C2C5761CC708528AB274ED57F4C258
2024-09-09 14:39:46.7195 | DEBUG | Certificate was found in store
2024-09-09 14:39:46.7327 | DEBUG | Loading certificate with thumbprint: E7363A5076C2C5761CC708528AB274ED57F4C258
2024-09-09 14:39:46.7507 | DEBUG | Certificate was found in store
2024-09-09 14:39:46.7841 | DEBUG | Loading certificate with thumbprint: E7363A5076C2C5761CC708528AB274ED57F4C258
2024-09-09 14:39:46.8097 | DEBUG | Certificate was found in store
2024-09-09 14:39:46.9654 |  INFO | Agent will trust Octopus Servers with the thumbprint: C23BD667F4B5B9EF5233FE34C5E8D1054EC8FFEE
2024-09-09 14:39:46.9818 |  INFO | Agent will poll Octopus Server at https://octopus.<snipped>:10943/ for subscription poll://l84vsys9r9euqc6h3psb/ expecting thumbprint C23BD667F4B5B9EF5233FE34C5E8D1054EC8FFEE
2024-09-09 14:39:46.9955 |  INFO | Agent will not use a proxy server
2024-09-09 14:39:47.0092 |  INFO | Requested polling connection count: 5
2024-09-09 14:39:47.0208 |  INFO | Starting 5 polling connections
2024-09-09 14:39:47.0372 |  INFO | Agent will not listen on any TCP ports
2024-09-09 14:39:47.0713 |  INFO | WorkspaceCleanerTask.Start(): Starting
2024-09-09 14:39:47.0823 | DEBUG | https://octopus.<snipped>:10943/    8  Opening a new connection to https://octopus.<snipped>:10943/
2024-09-09 14:39:47.0823 | DEBUG | https://octopus.<snipped>:10943/    9  Opening a new connection to https://octopus.<snipped>:10943/
2024-09-09 14:39:47.0823 | DEBUG | https://octopus.<snipped>:10943/   12  Opening a new connection to https://octopus.<snipped>:10943/
2024-09-09 14:39:47.0823 | DEBUG | https://octopus.<snipped>:10943/    6  Opening a new connection to https://octopus.<snipped>:10943/
2024-09-09 14:39:47.0823 | DEBUG | https://octopus.<snipped>:10943/   11  Opening a new connection to https://octopus.<snipped>:10943/
2024-09-09 14:39:47.0862 |  INFO | KubernetesPodMonitorTask.Start(): Starting
2024-09-09 14:39:47.1265 | DEBUG | Cleaning workspaces older than 09/08/2024 14:39
2024-09-09 14:39:47.1827 |  INFO | https://octopus.<snipped>:10943/   12  Secure connection established. Server at [::ffff:10.40.5.6]:10943 identified by thumbprint: C23BD667F4B5B9EF5233FE34C5E8D1054EC8FFEE, using protocol Tls12
2024-09-09 14:39:47.1827 |  INFO | https://octopus.<snipped>:10943/    6  Secure connection established. Server at [::ffff:10.40.5.6]:10943 identified by thumbprint: C23BD667F4B5B9EF5233FE34C5E8D1054EC8FFEE, using protocol Tls12
2024-09-09 14:39:47.1827 |  INFO | https://octopus.<snipped>:10943/   11  Secure connection established. Server at [::ffff:10.40.5.6]:10943 identified by thumbprint: C23BD667F4B5B9EF5233FE34C5E8D1054EC8FFEE, using protocol Tls12
2024-09-09 14:39:47.1827 |  INFO | https://octopus.<snipped>:10943/    8  Secure connection established. Server at [::ffff:10.40.5.6]:10943 identified by thumbprint: C23BD667F4B5B9EF5233FE34C5E8D1054EC8FFEE, using protocol Tls12
2024-09-09 14:39:47.1881 |  INFO | KubernetesOrphanedPodCleanerTask.Start(): Starting
2024-09-09 14:39:47.2112 | DEBUG | No workspaces found that need to be deleted.
2024-09-09 14:39:47.2455 |  INFO | https://octopus.<snipped>:10943/    6  Secure connection established. Server at [::ffff:10.40.5.6]:10943 identified by thumbprint: C23BD667F4B5B9EF5233FE34C5E8D1054EC8FFEE, using protocol Tls12
2024-09-09 14:39:47.2497 | DEBUG | Loading pod statuses
2024-09-09 14:39:47.2925 |  INFO | KubernetesEventMonitorTask.Start(): Starting
2024-09-09 14:39:47.3458 |  INFO | Parsing kubernetes event list for namespace octopus-agent.
2024-09-09 14:39:47.3522 | DEBUG | Loaded 0 pod statuses in 00:00:00.0335342. ResourceVersion: 101091083
2024-09-09 14:39:47.4294 |  INFO | Found 23 events since last logged event 0001-01-01T00:00:00.0000000+00:00
2024-09-09 14:39:47.4464 |  INFO | Added 0 entries to agent metrics.
2024-09-09 14:40:47.2855 | DEBUG | OrphanedPodCleaner: Checking for orphaned pods
2024-09-09 14:40:47.3248 | DEBUG | OrphanedPodCleaner: No orphaned pods found
2024-09-09 14:40:47.3380 | DEBUG | OrphanedPodCleaner: Next check will happen at 2024-09-09T14:50:47.3380634+00:00

More Information

image

^^ This never finishes even when doing this from the UI.

Workaround

No response

APErebus commented 1 month ago

Hi @cpressland

Sorry that this isn't working for you. It looks like the agent is failing to communicate with server using our RPC protocol.

Am I correct in assuming you are self-hosting Octopus Server?

If so, is the port 10943 open on the Octopus Server machine and accessible from the cluster?

cpressland commented 1 month ago

Hi @APErebus - we got to the bottom of this after a week or so of hacking and trying different things.

  1. We have multiple Octopus Servers in a HA Configuration, we were using The Helm Charts agent.serverCommsAddress to point to an Azure Load Balancer, we swapped to using agent.serverCommsAddresses and hitting each server directly.
  2. When updating from serverCommsAddress to serverCommsAddresses the agent went from intermittent to completely non-functional. This turned out to be that the agent only configures trusts for remote servers on its first launch, if you change this list after deployment it'll silently ignore those "new" servers and just sit there doing nothing, forever. We solved this by uninstalling the agent and reinstalling each time we had to make a config change.

I think I should raise another issue to get logging improved for this failure condition, else, I'll go ahead and close this off.