buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
417 stars 267 forks source link

Certain env vars set in pipeline are getting clobbered #891

Closed toothbrush closed 3 years ago

toothbrush commented 3 years ago

Hey folks, we've just adopted this Elastic CI stack instead of our legacy/hand-built one, and it mostly is working well!

We ran into a surprising issue though, when attempting to do a cross-region deploy. The situation is that we rely on AWS_DEFAULT_REGION to point deploy commands at the right target (that's probably not the best, but bear with me, legacy decisions that are now in many repos all over the place 🤷).

One of our pipelines looks roughly as follows – commands such as aws inside the Makefile rely on AWS_DEFAULT_REGION pointing where we want to deploy things:

steps:
  - name: us-east-1
    command: make deploy-app
    env:
      AWS_DEFAULT_REGION: 'us-east-1'
      KUBERNETES_DEPLOY_TARGET: 'https://kubernetes-production.us-east-1.our.domain'
    agents:
      queue: deploy-prod

  - name: us-west-2
    command: make deploy-app
    env:
      AWS_DEFAULT_REGION: 'us-west-2'
      KUBERNETES_DEPLOY_TARGET: 'https://kubernetes-production.us-west-2.our.domain'
    agents:
      queue: deploy-prod

However, when the steps run i see that in both cases, AWS_DEFAULT_REGION == us-east-1 – that's where our Buildkite agents run.

I suspect this code is running and somehow overriding the env: block of our pipeline.yml, but there must surely be a way to allow overriding some of these variables?

https://github.com/buildkite/elastic-ci-stack-for-aws/blob/368a03a14b51a44a87af733a96cc193c7b10a3eb/templates/aws-stack.yml#L981-L985

I've currently made a workaround for one of our repositories, but this feels like quite a footgun – if i set a value in the pipeline's env: block, i expect it to be respected in the executed steps.

cc: @novatan-rb @jonhiggs

toothbrush commented 3 years ago

I've just double-checked using a small repro:

steps:
  - name: us-east-1
    command: env
    env:
      TEST_FOO: bar
      AWS_DEFAULT_REGION: us-east-1

  - name: us-west-2
    command: env
    env:
      TEST_FOO: baz
      AWS_DEFAULT_REGION: us-west-2

Concentrating on the us-west-2 step, if i look at the "Environment" tab of the step, it looks as expected:

image

But if i look at the output of the steps, env disagrees:

[2021-08-18T02:56:04Z] $ trap 'kill -- $' INT TERM QUIT; env
[2021-08-18T02:56:04Z] BUILDKITE_CONFIG_PATH=/etc/buildkite-agent/buildkite-agent.cfg
[2021-08-18T02:56:04Z] AWS_REGION=us-east-1
[2021-08-18T02:56:04Z] AWS_DEFAULT_REGION=us-east-1
...

That doesn't seem right to me?

keithduncan commented 3 years ago

Hi @toothbrush!

The part you have linked to only sets environment variables for the bk-install-elastic-stack.sh script run as part of the EC2 Instance’s UserData. However, those environment variables do get saved and subsequently sourced by an Agent environment hook so actually will be set in your job environment by the hook and overwrite the one set by your job env 🤔

Does it seem reasonable to tack an AWS_DEFAULT_REGION=foo on the end of your make commands which will be set as an environment variable for make-spawned subcommands?

keithduncan@Keiths-Mac-mini make-test % make target | grep 'AWS_*'
keithduncan@Keiths-Mac-mini make-test % make target AWS_DEFAULT_REGION=us-east-1 | grep 'AWS_*'
declare -x AWS_DEFAULT_REGION="us-east-1"
declare -x MAKEFLAGS="AWS_DEFAULT_REGION=us-east-1"
toothbrush commented 3 years ago

Yeah – that's one way of working around it. I just made a patch to one of our repos to set DEPLOY_REGION=foo and in the Makefile re-export export AWS_DEFAULT_REGION=${DEPLOY_REGION} if set. This way the workaround is obvious.

My concern with an approach like you suggest is that the difference between the following isn't obvious at all:

steps:
  - name: approach 1
    command: make something
    env:
      AWS_DEFAULT_REGION: us-east-1

  - name: approach 2
    command: make something AWS_DEFAULT_REGION=us-west-2

..and someone is liable to come along later and not realise there's a big footgun lurking there.

Wouldn't it be preferable to patch this bit:

https://github.com/buildkite/elastic-ci-stack-for-aws/blob/368a03a14b51a44a87af733a96cc193c7b10a3eb/packer/linux/conf/bin/bk-install-elastic-stack.sh#L46-L47

Into something more like (pseudo-code, but i mean to say, use whatever AWS_DEFAULT_REGION is set, and only set it to the build agent's region if it's not set)?

export AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION:-${REGION}}
toothbrush commented 3 years ago

Actually, it's true that you probably want the agent to have a correct AWS_DEFAULT_REGION corresponding to where they "live", whereas the job should probably respect what's in the pipeline.yml? That makes my simplistic suggestion incorrect.

I still feel like this behaviour is rather unexpected.

lox commented 3 years ago

I recall running back into this problem years back. I'd planned to solve it by scoping those exports in the environment hook to something like BUILDKITE_STACK_AWS_REGION and then ensuring they are set correctly in the embedded plugins.

lox commented 3 years ago

Alternately, perhaps the stack plugins should query the EC2 meta-data service directly.

keithduncan commented 3 years ago

My concern about changing or removing these now is how much work it will generate for teams who rely on the environment variable being set today 🤔

We could make the built-in environment hook smarter, keeping these by default but not clobbering them if already set by the job env.

Putting something like this in the cfn-env instead:

# Escape, don't interpolate this when writing the cfn-env, check the environment process’ environment for this var
if [ -z "\${AWS_REGION:-}" ]
then
  # Do interpolate, pull in the bash process' AWS_REGION when writing the cfn-env
  export AWS_REGION=$AWS_REGION
fi

# Escape, don't interpolate when writing the cfn-env, check the environment process’ environment for this var
if [ -z "\${AWS_DEFAULT_REGION:-}" ]
then
  # Do interpolate, pull in the bash process' AWS_REGION when writing the cfn-env
  export AWS_DEFAULT_REGION="$AWS_REGION"
fi

Does it matter that these variables could shear? Should we instead check that neither are set by the time the environment hook is running and if true, set both to the stack env to mimic the prior behaviour?

toothbrush commented 3 years ago

We could make the built-in environment hook smarter, keeping these by default but not clobbering them if already set by the job env.

Yeah, this was my awkwardly-worded suggestion above indeed. However on Slack @lox suggested some plugins may rely on AWS_DEFAULT_REGION being equal to the region of the agent – i would humbly suggest that may be a bug if that's true?

Does it matter that these variables could shear? Should we instead check that neither are set by the time the environment hook is running and if true, set both to the stack env to mimic the prior behaviour?

I think i personally care less about the exact behaviour in edge-cases and much more about just being given big honking warnings in anything other than the happy path. For example, setting a variable in the pipeline but having it clobbered would ideally say something scary has happened.

keithduncan commented 3 years ago

Yeah, this was my awkwardly-worded suggestion above indeed. However on Slack @lox suggested some plugins may rely on AWS_DEFAULT_REGION being equal to the region of the agent – i would humbly suggest that may be a bug if that's true?

I think that’s okay from my perspective, any breakage will be confined to cases such as yours (sorry! 😅) where you are deliberately overriding these environment variables for the entire job rather than a piecewise aspect of it as with my make suggestion above.

If we adopt this I’m prepared to follow through with support for an explicit region parameter in any Buildkite plug-ins that need it to ensure robust multi-region support throughout. Though I can’t speak for other plug-in providers 😄

What do you think about preparing a branch with this change and if possible testing it on one of your deployed stacks (branches are only built for us-east-1, I hope that works in this case)? If it works for what you’re doing I’d be prepared to merge it :bow:

toothbrush commented 3 years ago

Cool, i appreciate the response. That all sounds very reasonable to me. I should be able to get to that within a reasonable timeframe (some number of weeks hopefully smaller than, say, 5).

We're in luck, all of our Buildkite fleet runs in us-east-1!