Scalr / agent-helm

Helm chart to install "scalr-agent" for connecting self-hosted runners and VCS to Scalr TACO
https://scalr.github.io/agent-helm/
5 stars 5 forks source link

k8s, Docker runner (agent) not working under M1 Mac Mini #56

Closed cpilson closed 10 months ago

cpilson commented 10 months ago

Hello.

I have a terraform plan that runs to what looks like completion, even giving outputs, then suddenly fails.

The k8s log (k8s set up via helm install --set agent.token=$SCALR_TOKEN --set agent.url=$SCALR_URL scalr-agent-helm/agent-k8s --generate-name) follows, and I'm left wondering if this is an architecture thing (I'm on an M1 CPU, which is ARM).

2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "event": "Failed to perform terraform plan. Unexpected exit code: 1.", "level": "info", "timestamp": "2024-01-21 10:48:15.510780", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "previous_status": "in_progress", "code": "error", "event": "The agent task is errored.", "level": "info", "timestamp": "2024-01-21 10:48:15.747213", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:17 {"event": "terraform.plan.v2[atask-v0o7n4rhqeonnv8t8] executed in 83.407s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 10:48:17.657362", "thread_name": "Worker-1"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.164559", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53] executed in 0.000s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.165837", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.plugins[05d84b5e-e051-4530-bc97-ebcae0468502]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.167133", "thread_name": "Worker-2"}

In the UI, here's what I see:

Plan: 75 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + account_id = "38149XXXXXXX"

Plan operation failed

Error: Failed to perform terraform plan. Unexpected exit code: 1
cpilson commented 10 months ago

Update: same output on an EC2 t2.medium runner. Connects, plans, errors with the above output.

Memory of 4GB hit about 46% max utilization, so the container shouldn't have burst.

emocharnik commented 10 months ago

Hi, @cpilson. Have you resolved the issue?

cpilson commented 10 months ago

Well, it feels like a scenario where logging may be insufficient, but I’ve not looked to see if the agent binary has a debug/trace flag for logging.

The issue resolved when I went away from using local_file to copy files (in RAM), but I’m still trying to explore how 896K of files could make things collapse on an EC2 runner with 4GB RAM.

So, no, but I’m not sure ultimately just how much insight into the state of a container the binary would even have, or if that responsibility lies with the agent binary, or if anything can even be done here.