Open caius opened 8 months ago
The plot thickens even further. What we're trying to achieve is calling terraform with HTTPS_PROXY=http://127.0.0.1:8080
so HTTPS connections from within terraform go via tailscaled
which is acting as a HTTP proxy. Anything for the tailnet goes across it, anything for the internet goes from the container.
The limitations are we can only start tailscaled
during the before hooks of a phase, and we need to stop it during the after phase (or when the shell exits) otherwise Spacelift sits there for 10 minutes waiting for all processes in the container to exit.
For Plan & Perform phases, spacelift runs in the following order:
/mnt/workspace/.env_hooks_before
before_$phase
hooks invoked$phase
command invoked (eg, terraform plan …
)after_$phase
hooks invoked/mnt/workspace/.env_hooks_after
Steps 1 through 5 run inside the same ash shell (sh -c "… && …"
), step 6 I guess runs from the spacelift-worker
binary itself. Haven't confirmed that though. (Setting set -o xtrace
in the ash shell doesn't have an effect after the shell exits.)
In the initial instance (as per README on main
) we define before_$phase
on a context as:
spacetail up
trap 'spacetail down' EXIT
(Using trap
here to stop the tailscaled
process after $phase
has been executed as the shell exits, even if $phase
command errored. If we use an after_$phase
hook to stop tailscaled, it doesn't stop when the $phase
command errored and the container gets "stuck" waiting for processes to exit until it hits a 10 minute timeout.)
And have the following environment variables set on the context too:
HTTP_PROXY=http://127.0.0.1:8080
HTTPS_PROXY=http://127.0.0.1:8080
Running a plan now lets terraform talk to HTTPS endpoint over the tailnet, but then Step 6 fails with a proxy error:
[01HP7K3N3GNWZ9KQG934MM6A29] Changes are GO
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│
│
╵
[01HP7K3N3GNWZ9KQG934MM6A29] unexpected exit code when running show command: 1
[01HP7K3N3GNWZ9KQG934MM6A29] Uploading the list of managed resources...
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│
│
╵
[01HP7K3N3GNWZ9KQG934MM6A29] Unexpected exit code when listing outputs: 1
I think this is bubbling out from https://github.com/golang/go/blob/e17e5308fd5a26da5702d16cc837ee77cdb30ab6/src/net/http/transport.go#L1617 which is why I suspect spacelift-worker
is trying to talk to S3 itself and Go picks up HTTP_PROXY
, HTTPS_PROXY
from the environment it's in.
Looking at Spacelift's terraform workflow.yml it'll either be calling terraform show -json
or terraform show -json {{ .PlanFileName }}
and presumably loading the envariables from the file to make sure terraform can execute properly. In this case we don't want it to use the proxy, because our state isn't over the tailnet. (And because we can't leave tailscaled
running for this, because it's happening after we can control the runtime.)
unset
AttemptAfter some thinking and debugging, came up with setting the HTTP_PROXY
and HTTPS_PROXY
environment variables in the before phase hooks and unsetting them in the after phase hooks rather than defining as environment variables, so they never get persisted into the env file, therefore aren't defined at the point the S3 upload happens from spacelift-worker
too.
(As mentioned in the issue body above, initially attempted using unset
from the trap
but that doesn't work because the Environment is persisted to disk before the shell exits and trap is called.)
So we end up with the context having no environment variables set, and the before_$phase
hooks set to:
spacetail up
trap 'spacetail down
EXIT`export HTTP_PROXY=http://127.0.0.1:8080 HTTPS_PROXY=http://127.0.0.1:8080
And the corresponding after_$phase
hooks set to:
unset HTTP_PROXY HTTPS_PROXY
🎉 This works for Plan phase. 😭 And then fails for the Apply phase. Also works fine for the Perform phase.
Turns out, after a bunch of debugging (drop set -o xtrace
in a before hook to observe what's being run in the shell) there's an ordering difference (bug?) with apply phase compared to Plan and Perform phases.
For the Apply phase the list of steps above doesn't hold true, the environment persistence and after hooks are swapped in ordering. So I'm observing the following happening for an Apply Phase:
/mnt/workspace/.env_hooks_before
before_$phase
hooks invoked$phase
command invoked (eg, terraform plan …
)/mnt/workspace/.env_hooks_after
after_$phase
hooks invoked(Step 4 & 5 have reversed.)
So now we're back to terraform show
/ S3 bucket upload erroring out trying to use a HTTPS_PROXY
that has been shut down by the time Step 6 is invoked.
[01HPGVA7CVJEAH3CPHARYR31N7] Changes applied successfully
[01HPGVA7CVJEAH3CPHARYR31N7] Uploading the list of managed resources...
╷
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: RequestError: send request failed
│ caused by: Post "https://sts.amazonaws.com/": proxyconnect tcp: dial tcp 127.0.0.1:8080: connect: connection refused
│
│
╵
[01HPGVA7CVJEAH3CPHARYR31N7] Uploading the list of managed resources failed: unexpected exit code when running show command: 1
As a workaround for now, I'm editing the /mnt/workspace/.env_hooks_after
file on disk in an after hook to remove the HTTP_PROXY=
and HTTPS_PROXY=
lines so they aren't loaded when Step 6 is running terraform show
. So the working hooks for now are
before_$phase
:
spacetail up
trap 'spacetail down' EXIT
export HTTP_PROXY=http://127.0.0.1:8080 HTTPS_PROXY=http://127.0.0.1:8080
after_$phase
:
unset HTTP_PROXY HTTPS_PROXY
sed -e '/HTTP_PROXY=/d' -e /HTTPS_PROXY/d -i /mnt/workspace/.env_hooks_after || true
This ensures the environment variables aren't left in the env after any of the phases run, but it's weird the Apply phase saves the environment variables then runs the after hooks.
Can't set
HTTPS_PROXY
in the environment because it breaks Spacelift uploading your planned resources/changes into their own S3 bucket.Instead need to have
export HTTPS_PROXY=http://127.0.0.1:8080
in thebefore_*
hooks, andunset HTTPS_PROXY
in the matchingafter_*
hooks. This stops Spacelift from saving this envariable as part of the environment.(We can't use the
trap '' EXIT
logic because that runs after Spacelift has saved the transient envariables to disk.)