Closed loeffel-io closed 1 year ago
My fault, everything works as expected 👍
Hey @loeffel-io! Glad you got it to work :-) Was there anything we could improve in the docs that would've helped? What was the issue?
@tgolsson
It was just the wrong ssh key ..
Right now i struggle with the docker image - you require jsonnet
and base32
as deps, but there is no working buildkite-agent
docker image which provides those deps. The k8s-buildkite-agent image fails with:
{
"textPayload": "/entrypoint.sh: line 15: /usr/local/bin/buildkite-agent: No such file or directory",
"insertId": "5zwu1zjp9oi4q4kz",
"resource": {
"type": "k8s_container",
"labels": {
"namespace_name": "buildkite",
"location": "us-central1",
"cluster_name": "buildkite-gke-production",
"pod_name": "buildkite-agent-845ff66dbc-h9pmm",
"container_name": "agent",
"project_id": "buildkite-374309"
}
},
"timestamp": "2023-01-31T10:20:17.669023321Z",
"severity": "ERROR",
"labels": {
"k8s-pod/release": "buildkite",
"k8s-pod/pod-template-hash": "845ff66dbc",
"k8s-pod/app": "agent",
"compute.googleapis.com/resource_name": "gke-buildkite-gke-pr-buildkite-gke-no-de5b500c-qd9d"
},
"logName": "projects/buildkite-374309/logs/stderr",
"receiveTimestamp": "2023-01-31T10:20:18.868420197Z"
}
So i think the initial hurdle is way to big - i don't want to maintain my own buildkite agent docker image.
After some research - i have no glue where this is running:
~~~ Preparing plugins
[90m# Plugin "github.com/EmbarkStudios/k8s-buildkite-plugin" already checked out (0e13cac)[0m
~~~ Preparing working directory
[90m$[0m cd /buildkite/builds/buildkite-agent-54b457fdd7-7k7rz-1/mindful/global-base
[90m# Host "github.com" already in list of known hosts at "/root/.ssh/known_hosts"[0m
[90m$[0m git remote set-url origin git@github.com:mindful-hq/global-base.git
[90m$[0m git clean -ffxdq
[90m$[0m git fetch -v --prune -- origin 56aba901dfe4973ddad928a3e4910a0df572c814
From github.com:mindful-hq/global-base
* branch 56aba901dfe4973ddad928a3e4910a0df572c814 -> FETCH_HEAD
[90m$[0m git checkout -f 56aba901dfe4973ddad928a3e4910a0df572c814
HEAD is now at 56aba90 test: buildkite
[90m# Cleaning again to catch any post-checkout changes[0m
[90m$[0m git clean -ffxdq
[90m# Checking to see if Git data needs to be sent to Buildkite[0m
[90m$[0m buildkite-agent meta-data exists buildkite:git:commit
~~~ Running plugin k8s command hook
[90m$[0m /buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin/hooks/command
/buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin/hooks/command: line 16: base32: command not found
--- :kubernetes: Starting Kubernetes Job
/buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin/hooks/command: line 93: jsonnet: command not found
[31m🚨 Error: The command exited with status 127[0m
^^^ +++
^^^ +++
~~~ Running plugin k8s pre-exit hook
[90m$[0m /buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin/hooks/pre-exit
--- :kubernetes: Cleanup
[90m$[0m cd /buildkite/builds/buildkite-agent-54b457fdd7-7k7rz-1/mindful/global-base
is it running in my buildkite-agent
?
is it running in the init image
(https://github.com/EmbarkStudios/k8s-buildkite-plugin/blob/master/lib/job.jsonnet#L31)?
is it running in my step image
?
I can't find any information about that
looks like this belongs to the issue https://github.com/EmbarkStudios/k8s-buildkite-plugin/issues/10 and since i am using the buildkite chart this is running in my buildkite-agent
image
So yeah, there's three phases of the job running.
As you found in the other issue; yeah; you need a modified docker image - the base Buildkite one doesn't have jsonnet or other tools we need. I'm not sure if anything has changed since that issue or if the base image we publish would work now. The ones we run internally have a lot more tools for things that run without the plugin, e.g. C++ compilers, etc. I'd try overriding the image in the chart with the one from here: https://hub.docker.com/r/embarkstudios/k8s-buildkite-agent.
Thank you @tgolsson 🙏
The job then runs the init container (which is also buildkite agent) to set up the general build workspace as a regular Buildkite agent would. https://buildkite.com/docs/agent/v3/cli-bootstrap
I am pretty sure that this is running by the https://hub.docker.com/r/embarkstudios/k8s-buildkite-agent here https://github.com/EmbarkStudios/k8s-buildkite-plugin/blob/master/lib/job.jsonnet#L31 isn't it?
I'd try overriding the image in the chart with the one from here: https://hub.docker.com/r/embarkstudios/k8s-buildkite-agent
I already tried that for your point 1 which generates those errors mentioned above "textPayload": "/entrypoint.sh: line 15: /usr/local/bin/buildkite-agent: No such file or directory"
I still wondering which of those steps produces the above error message line 93: jsonnet: command not found
?
Thank you very much 🙏
I am pretty sure that this is running by the https://hub.docker.com/r/embarkstudios/k8s-buildkite-agent here https://github.com/EmbarkStudios/k8s-buildkite-plugin/blob/master/lib/job.jsonnet#L31 isn't it?
Yepp! It runs buildkite-agent inside that container.
I already tried that for your point 1 which generates those errors mentioned above "textPayload": "/entrypoint.sh: line 15: /usr/local/bin/buildkite-agent: No such file or directory"
This seems like a build bug - the published image is incomplete 😱. If I build it locally it does have buildkite-agent in there. If you run the much older 1.2.0 image that one has the file as well (it might be broken by age now though - API versions etc). Will investigate!
I still wondering which of those steps produces the above error message line 93: jsonnet: command not found?
The jsonnet library is used during the first step, when we generate the job.
Amazing, thank you very much for those informations 🙏
@loeffel-io I've pushed a new latest image, feel free to try that. I've validated that it has the correct binaries.
sha256:1d88791315ed6b0b49a64055bc71c5a9a0b1953e387f99d25299ed06ccea5dbd is the SHA for the fixed one.
@tgolsson great, thanks!
I also bumped the k8s init image
: https://github.com/EmbarkStudios/k8s-buildkite-plugin/pull/58
Thanks, and new release done!
Great work @tgolsson!
One last thing: shouldn't we may bump the versions in the dockerfile? https://github.com/EmbarkStudios/k8s-buildkite-plugin/blob/master/Dockerfile
The buildkite agent version itself is 2 years old: https://hub.docker.com/layers/buildkite/agent/3.29.0/images/sha256-5c7d788323b084affed6ee2d6a73e8cff9ff2714af327648ae7c8c99aba32487?context=explore
⚠️
the image is not working:
{
"textPayload": "Use \"buildkite-agent <command> --help\" for more information about a command.",
"insertId": "7c7p1tfaixqptibl",
"resource": {
"type": "k8s_container",
"labels": {
"pod_name": "buildkite-agent-845ff66dbc-d86rt",
"container_name": "agent",
"location": "us-central1",
"namespace_name": "buildkite",
"project_id": "buildkite-374309",
"cluster_name": "buildkite-gke-production"
}
},
"timestamp": "2023-02-01T09:52:29.997493522Z",
"severity": "INFO",
"labels": {
"k8s-pod/pod-template-hash": "845ff66dbc",
"compute.googleapis.com/resource_name": "gke-buildkite-gke-pr-buildkite-gke-no-de5b500c-o6ww",
"k8s-pod/release": "buildkite",
"k8s-pod/app": "agent"
},
"logName": "projects/buildkite-374309/logs/stdout",
"receiveTimestamp": "2023-02-01T09:52:32.304224852Z"
}
I'm generally hesitant to bump for the sake of bumping - it leads to churn and potential disruption. But yeah, maybe 2 years old is a bit old... I'm just worried about breaking changes then. If you want to PR a bump (maybe for all tools?) we can see how much has changed.
Hmm, odd. Weird that it doesn't say what it fails to do. Is this during setup, node-boot, ..?
This happens when i want to start the buildkite helm chart with the new image - which was the standard buildkite/agent image before
Right! So I think that happens because we override the entrypoint in the init-container image, and the helm chart relies on whatever is baked into the buildkite-agent image.
OK; it looks like there's a special entrypoint that needs to run too when bootstrapping the node. I think maybe it'd make sense for this project to publish a base image that could work for the node too. I've got quite a bunch of things to do today, but the current Dockerfile is quite close to what's needed... just need to not build it into an alpine base.
great, shouldn't it be easy to just add the jsonnet
binary to the original buildkite/agent:3.x-alpine-k8s
image (includes kubectl)? or is there more to do?
That does sound about right. There's a bunch of installs in the base one, some of them may be needed for jsonnet, maybe.
I'll give that a try
update: the current error message for running the self made image
RUNTIME ERROR: Field does not exist: BUILDKITE_BUILD_CREATOR_TEAMS
--
| /buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin/lib/job.jsonnet:117:28-61 object <anonymous>
| Field "build/creator-teams"
| Field "annotations"
| Field "metadata"
| During manifestation
could belong to
/buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin/hooks/command
--
| /buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin/hooks/command: line 16: base32: command not found
update: base32 is fixed
i have no glue how to fix the BUILDKITE_BUILD_CREATOR_TEAMS
error - would love to get some help with that
That should be set when setting up a job.
https://buildkite.com/docs/pipelines/environment-variables#BUILDKITE_BUILD_CREATOR_TEAMS. I believe you should see that in the step environment information. I'm not sure if that might be missing if you have no teams - for my latest build on Buildkite it lists the three teams I'm in there. Can you see it as well? We might need to guard that setting in case the triggering user has no teams.
Nope, its not there
CI="true"
BUILDKITE="true"
BUILDKITE_ORGANIZATION_SLUG="mindful"
BUILDKITE_PIPELINE_SLUG="global-base"
BUILDKITE_PIPELINE_NAME="global-base"
BUILDKITE_PIPELINE_ID="018606c8-d6d0-472c-a946-232e5160058f"
BUILDKITE_PIPELINE_PROVIDER="github"
BUILDKITE_PIPELINE_DEFAULT_BRANCH="master"
BUILDKITE_REPO="git@github.com:mindful-hq/global-base.git"
BUILDKITE_BUILD_ID="01860eae-c4c4-457a-a880-48edaae60705"
BUILDKITE_BUILD_NUMBER="27"
BUILDKITE_BUILD_URL="https://buildkite.com/mindful/global-base/builds/27"
BUILDKITE_BRANCH="main"
BUILDKITE_TAG=""
BUILDKITE_COMMIT="52db6228b519667d1185f581cc14b8e29e164fe9"
BUILDKITE_MESSAGE="test: buildkite"
BUILDKITE_SOURCE="webhook"
BUILDKITE_BUILD_AUTHOR="Lucas Löffel"
BUILDKITE_BUILD_AUTHOR_EMAIL="lucas@loeffel.io"
BUILDKITE_BUILD_CREATOR="Lucas Löffel"
BUILDKITE_BUILD_CREATOR_EMAIL="lucas@loeffel.io"
BUILDKITE_REBUILT_FROM_BUILD_ID=""
BUILDKITE_REBUILT_FROM_BUILD_NUMBER=""
BUILDKITE_PULL_REQUEST="false"
BUILDKITE_PULL_REQUEST_BASE_BRANCH=""
BUILDKITE_PULL_REQUEST_REPO=""
BUILDKITE_TRIGGERED_FROM_BUILD_ID=""
BUILDKITE_TRIGGERED_FROM_BUILD_NUMBER=""
BUILDKITE_TRIGGERED_FROM_BUILD_PIPELINE_SLUG=""
BUILDKITE_JOB_ID="01860eb0-6d03-4617-96bb-444d4a961f87"
BUILDKITE_LABEL="global"
BUILDKITE_COMMAND="bazel test --remote_cache= --google_credentials= //...
bazel build --remote_cache= --google_credentials= //..."
BUILDKITE_ARTIFACT_PATHS=""
BUILDKITE_RETRY_COUNT="0"
BUILDKITE_TIMEOUT="false"
BUILDKITE_STEP_KEY=""
BUILDKITE_STEP_ID="01860eb0-682f-4c03-8bcb-3a8e00bf880e"
BUILDKITE_PROJECT_SLUG="mindful/global-base"
BUILDKITE_PROJECT_PROVIDER="github"
BUILDKITE_SCRIPT_PATH="bazel test --remote_cache= --google_credentials= //...
bazel build --remote_cache= --google_credentials= //..."
BUILDKITE_AGENT_ID="01860ea8-eab9-42f4-9814-acfa820bbf69"
BUILDKITE_AGENT_NAME="buildkite-agent-5cd5ffd9cf-trbgl-1"
BUILDKITE_AGENT_META_DATA_QUEUE="default"
BUILDKITE_AGENT_META_DATA_ROLE="agent"
BUILDKITE_REPO_SSH_HOST="github.com"
BUILDKITE_PLUGINS="[{\"github.com/EmbarkStudios/k8s-buildkite-plugin#v1.2.15\":{\"image\":\"gcr.io/bazel-public/bazel:6.0.0\",\"shell\":[\"sh\",\"-e\",\"-c\"],\"entrypoint\":\"\",\"secret-name\":\"buildkite-agent\",\"git-ssh-secret-key\":\"agent-ssh\",\"service-account-name\":\"global-base-production\",\"agent-token-secret-key\":\"agent-token\"}}]"
there are no teams yet btw i really need to get that done - could it be possible for you to fix that soon? thank you so much for the information, helped me a lot to understand the issue for now i just created a team and i'll create a bug ticket for this
Interesting. It shouldn't be too hard to fix, will take a peek tomorrow.
amazing @tgolsson! 🙏
i think i never had such a bad experience to setup a plugin
i now got this error message
/buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin-v1-3-0/hooks/command: line 169: BUILDKITE_PLUGIN_K8S_INIT_IMAGE: unbound variable
do you have any idea @tgolsson
I'm sorry you feel that way - I noticed when I took over (as maintainer) that there hasn't been a full release of the actual plugin since 2021, so likely a bunch of code-rot has happened since then and some unpublished changes that likely break things, as with that error.
I've pushed a guard clause for that to the same branch you used before - new commit 44b05b2ef952c75809f7603e1b8607f57ac194ea.
With 44b05b2
i get (think that commit did not help)
[90m# Cleaning again to catch any post-checkout changes[0m
[90m$[0m git clean -ffxdq
[90m# Checking to see if Git data needs to be sent to Buildkite[0m
[90m$[0m buildkite-agent meta-data exists buildkite:git:commit
~~~ Running plugin k8s command hook
[90m$[0m /buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin-44b05b2ef952c75809f7603e1b8607f57ac194ea/hooks/command
--- :kubernetes: Starting Kubernetes Job
job.batch/global-base-38-3kyz24ij created
Timeout: 36000s
--- :kubernetes: Running image: gcr.io/bazel-public/bazel:6.0.0
Pod is running: global-base-38-3kyz24ij-sgxvb
+++ :kubernetes: step container
--- :kubernetes: Job status: Failed
Warning: init container failed with exit code 1, this usually indicates plugin misconfiguration or infrastructure failure
[31m🚨 Error: The command exited with status 1[0m
^^^ +++
^^^ +++
user command error: The plugin k8s command hook exited with status 1
~~~ Running plugin k8s pre-exit hook
[90m$[0m /buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin-44b05b2ef952c75809f7603e1b8607f57ac194ea/hooks/pre-exit
--- :kubernetes: Cleanup
With init-image: "embarkstudios/k8s-buildkite-agent@sha256:3c010d09915f3b39c2f8324af5f0aaf910a643e7d63607ee8d49653931b8b167"
i get this and then it get stucked endless on boostrap container
[90m# Cleaning again to catch any post-checkout changes[0m
[90m$[0m git clean -ffxdq
[90m# Checking to see if Git data needs to be sent to Buildkite[0m
[90m$[0m buildkite-agent meta-data exists buildkite:git:commit
~~~ Running plugin k8s command hook
[90m$[0m /buildkite/plugins/github-com-EmbarkStudios-k8s-buildkite-plugin-v1-3-0/hooks/command
--- :kubernetes: Starting Kubernetes Job
job.batch/global-base-39-u2mdyyfq created
Timeout: 36000s
--- :kubernetes: Running image: gcr.io/bazel-public/bazel:6.0.0
Pod is running: global-base-39-u2mdyyfq-ncq86
--- :kubernetes: bootstrap container
so setting init-image
looks promising rn but why it gets stucked at kubernetes: bootstrap container
? Maybe because my buildkite-agent
image is not running your https://github.com/EmbarkStudios/k8s-buildkite-plugin/blob/1.2.15/entrypoint.sh file?
update:
i checked the gke logs and the container get stuck with those logs:
So that's progress! I'm not sure why the init container would fail if not running an init image, that sounds like a bug and a good case for a self-test. I'll see if I can whip that up after lunch.
The ssh-key thing in the log sounds like a configuration error - I believe that can happen if you have newline issues at the end of the key. Either missing or one too many... (Edit: After some googling it looks like it's a missing newline at the end most commonly because a lot of tools trim that.)
important question i think: does it require the private or public key at this stage?
That should be the private key to match the public one you've given to GitHub.
because it's the private key which works great one step earlier
tried it trimmed and with newline
maybe important: the private key is a kubernetes base64 encoded secret
That should be fine. Can you decode the key and validate that the newline is actually there? I know some tools might strip whitespace while encoding the key, especially if it's passed on the command line.
The key is (value from gcloud secret)
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
QyNTUxOQAAACAr6Vxxx...
-----END OPENSSH PRIVATE KEY-----
just some thoughts: because the buildkite agent version is so old - maybe it want a RSA PRIVATE KEY or something?
update:
tested it with a new legacy-system key: https://docs.github.com/de/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent#einen-neuen-ssh-schl%C3%BCssel-erzeugen
same result
update:
tested it with $ ssh-keygen -t ed25519 -C "your_email@example.com"
same result
yeah; that looks good, but does it have \n
at the back or not? :P Could also be wrong line endings, so having CRLF instead of LF for example. But yeah, there have been some deprecations in openssh, that seem to lead to this. So either an older openssh or a newer key might work.
Also - confusingly - it does work in the main container - right? So I'm going to guess this is related to ssh-agent, which is how it's set up in the entrypoint. Is the main buildkite-agent using the same ssh-key mount?
In the same vein (just to rule out copy-paste errors etc): are you looking at the right secret? The /secrets/ssh-key
can both be picked from the default secrets, or from the git-ssh-secret-key config etc. Might be worth checking the job spec to see what is actually being mounted. I notice you specced it as agent-ssh in the first post, for example - does that have the right ssh-key
sub-item?
yeah; that looks good, but does it have \n at the back or not? :P Could also be wrong line endings, so having CRLF instead of LF for example. But yeah, there have been some deprecations in openssh, that seem to lead to this. So either an older openssh or a newer key might work.
how would you check that?
Also - confusingly - it does work in the main container - right? So I'm going to guess this is related to ssh-agent, which is how it's set up in the entrypoint. Is the main buildkite-agent using the same ssh-key mount?
yes
In the same vein (just to rule out copy-paste errors etc): are you looking at the right secret? The /secrets/ssh-key can both be picked from the default secrets, or from the git-ssh-secret-key config etc. Might be worth checking the job spec to see what is actually being mounted. I notice you specced it as agent-ssh in the first post, for example - does that have the right ssh-key sub-item?
this is my pipeline.yml
steps:
- group: "Global"
key: "global"
steps:
- plugins:
- EmbarkStudios/k8s#v1.3.0:
image: "gcr.io/bazel-public/bazel:6.0.0"
entrypoint: ""
shell: [ "sh", "-e", "-c" ]
service-account-name: "global-base-production"
secret-name: "buildkite-agent"
agent-token-secret-key: "agent-token"
git-ssh-secret-key: "agent-ssh"
init-image: "embarkstudios/k8s-buildkite-agent@sha256:3c010d09915f3b39c2f8324af5f0aaf910a643e7d63607ee8d49653931b8b167"
label: "global"
command:
- bazel test --remote_cache=$GOOGLE_BUCKET_PRODUCTION --google_credentials=$GOOGLE_CREDENTIALS_PRODUCTION //...
- bazel build --remote_cache=$GOOGLE_BUCKET_PRODUCTION --google_credentials=$GOOGLE_CREDENTIALS_PRODUCTION //...
if i change git-ssh-secret-key
to "agent-ssh-test"
it fails with something like secret not found
running out of energy for this ..
update: i created my own init image to modify the versions and check the ssh key from the /secrets/ssh-key
file - everything looks good - the key is there in plain text and ssh-add -k /secrets/ssh-key
still throws Error loading key "/secrets/ssh-key": invalid format
I'm looking at reproing on my branch, and it looks like our variant of this passes - I've fixed the bug with init-image config there, but can't repro the SSH. Can you try cat -e /secrets/ssh-key
? And ensure each line including last has only $
, not ^M$
All lines of the logs have $
at the end - but this one looks weird!
{
"textPayload": "-----END OPENSSH PRIVATE KEY-----Agent pid 10",
"insertId": "xasuvsciy5nzspe8",
"resource": {
"type": "k8s_container",
"labels": {
"pod_name": "global-base-55-zplhv5zh-nn5hb",
"project_id": "buildkite-374309",
"location": "us-central1",
"container_name": "bootstrap",
"namespace_name": "buildkite",
"cluster_name": "buildkite-gke-production"
}
},
That looks like there's no trailing newline so two lines get merged when catting it.
I've added a newline (?) now to my google secret manager secret. The thing is, that this secret gets downloaded at my script through gcloud and gets to terraform through a input. long story short: after adding the newline (?) terraform do not recognize any changes - so i think it will get trimmed here
need to check that after lunch
@loeffel-io FWIW I did a dig in and found a few bugs/edge-cases in how we create jobs. I get a passing run in our env with
- EmbarkStudios/k8s#6b36fe4f6b770cdb97fd420b50cc94cc1c0bcbce:
as the plugin spec.
This is the full config we have on that branch:
amazing! 🙏
Hello,
i try to configure the buildkite helm chart. It creates those two base64 secrets (https://github.com/buildkite/charts/blob/master/stable/agent/templates/secret.yaml#L12):
How to configure the Plugin now? I am pretty confused about all those different options like
secret-name
anddefault-secret-name
This is my current configuration
Would really like to get some quick help 🙏 ❤️