cloudfoundry / guardian

containers4life
Apache License 2.0
76 stars 42 forks source link

ip: command not found #109

Closed karlkfi closed 6 years ago

karlkfi commented 6 years ago

Description

Upgraded to concourse v3.9.1 on my AWS cluster using Amazon Linux instances and now the workers are erroring on startup.

It looks like it might be a PATH error, because ip IS installed and available in the normal path at /sbin/ip.

But the PATH IS being copied: https://github.com/cloudfoundry/guardian/blob/master/kawasaki/iptables/global_chains.go#L246

The ip: command not found error is escaping a bash subshell (because the subshell isn't using errexit): https://github.com/cloudfoundry/guardian/blob/master/kawasaki/iptables/global_chains.go#L105

Then the result isn't being quoted so the iptables command explodes: https://github.com/cloudfoundry/guardian/blob/master/kawasaki/iptables/global_chains.go#L111

So that causes the fatal Bad argument `ACCEPT'

Logging

From the worker log (stdout+stderr):

Exit trace for group:
garden exited with error: Exit trace for group:
garden-runc exited with error: bulk starter: setting up default chains: iptables: setup-global-chains: + set -o nounset
+ set -o errexit
+ shopt -s nullglob
+ filter_input_chain=w--input
+ filter_forward_chain=w--forward
+ filter_default_chain=w--default
+ filter_instance_prefix=w--instance-
+ nat_prerouting_chain=w--prerouting
+ nat_postrouting_chain=w--postrouting
+ nat_instance_prefix=w--instance-
+ iptables_bin=/opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables
+ case "${ACTION}" in
+ setup_filter
+ teardown_filter
+ teardown_deprecated_rules
++ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -S INPUT
+ rules='-P INPUT ACCEPT'
+ echo '-P INPUT ACCEPT'
+ grep ' -j garden-dispatch'
+ sed -e 's/--icmp-type any/--icmp-type 255\/255/'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ xargs --no-run-if-empty --max-lines=1 /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w
++ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -S FORWARD
+ rules='-P FORWARD ACCEPT'
+ echo '-P FORWARD ACCEPT'
+ grep ' -j garden-dispatch'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ xargs --no-run-if-empty --max-lines=1 /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w
+ sed -e 's/--icmp-type any/--icmp-type 255\/255/'
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -F garden-dispatch
+ true
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -X garden-dispatch
+ true
++ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -S w--forward
+ rules=
+ true
+ echo ''
+ grep '\-g w--instance-'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ xargs --no-run-if-empty --max-lines=1 /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w
+ sed -e 's/--icmp-type any/--icmp-type 255\/255/'
++ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -S
+ rules='-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT'
+ echo '-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT'
+ grep '^-A w--instance-'
+ sed -e 's/--icmp-type any/--icmp-type 255\/255/'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ xargs --no-run-if-empty --max-lines=1 /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w
++ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -S
+ rules='-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT'
+ echo '-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT'
+ grep '^-N w--instance-'
+ sed -e s/-N/-X/ -e 's/\s\+$//'
+ sed -e 's/--icmp-type any/--icmp-type 255\/255/'
+ xargs --no-run-if-empty --max-lines=1 /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w
++ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -S FORWARD
+ rules='-P FORWARD ACCEPT'
+ echo '-P FORWARD ACCEPT'
+ grep ' -j w--forward'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ xargs --no-run-if-empty --max-lines=1 /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w
+ sed -e 's/--icmp-type any/--icmp-type 255\/255/'
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -F w--forward
+ true
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -F w--default
+ true
++ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -S INPUT
+ rules='-P INPUT ACCEPT'
+ echo '-P INPUT ACCEPT'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ sed -e 's/--icmp-type any/--icmp-type 255\/255/'
+ grep ' -j w--input'
+ xargs --no-run-if-empty --max-lines=1 /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -F w--input
+ true
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -X w--input
+ true
++ ip route show
bash: line 94: ip: command not found
++ grep default
++ head -1
++ cut '-d ' -f5
+ default_interface=
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -N w--input
+ /opt/concourse/worker/3.9.1/assets/iptables/sbin/iptables -w -I w--input -i --jump ACCEPT
Bad argument `ACCEPT'
Try `iptables -h' or 'iptables --help' for more information.

baggageclaim exited with nil
beacon exited with nil

2018/03/01 00:14:43 failed to forward remote connection: dial tcp 127.0.0.1:7777: connect: connection refused

Steps to reproduce

Outline the steps to test or reproduce the PR here. Please also provide the following information if applicable:

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/155610309

The labels on this github issue will be updated when the story is started.

danail-branekov commented 6 years ago

Hi @karlkfi,

Garden is not setting the location of ip into the PATH explicitly, it just expects that it is there. Could you please make sure that /sbin is available on PATH? Was the concourse version the only thing you changed?

Thanks!

karlkfi commented 6 years ago

SSHing into the worker box:

$ sudo su
# echo $PATH
/sbin:/bin:/usr/sbin:/usr/bin:/opt/aws/bin
# which ip
/sbin/ip

The ip error seems to only happen the first time, perhaps because I have cron running it on reboot... maybe it's being run before PATH is set up?

If I SSH in after it fails and then try to manually relaunch... now I get a different error:

CONCOURSE_BAGGAGECLAIM_DRIVER=naive \
/usr/local/bin/concourse worker \
--work-dir /opt/concourse/worker \
--tsa-host $ELB_HOST \
--tsa-port 2222 \
--tsa-public-key /home/ec2-user/keys/worker/tsa_host_key.pub \
--tsa-worker-private-key /home/ec2-user/keys/worker/worker_key \
&>/var/log/concourse_worker.log

From the end of the log:

2018/03/01 18:19:26 failed to forward remote connection: dial tcp 127.0.0.1:7777: connect: connection refused
Exit trace for group:
garden exited with error: Exit trace for group:
garden-runc exited with error: bulk starter: iptables: flushing-default-chain: iptables: No chain/target/match by that name.

baggageclaim exited with nil
beacon exited with nil
karlkfi commented 6 years ago

That error comes from here: https://github.com/cloudfoundry/guardian/blob/master/kawasaki/iptables/global_chains.go#L290

karlkfi commented 6 years ago

I was able to workaround the corrupted state by adding: CONCOURSE_GARDEN_DESTROY_CONTAINERS_ON_STARTUP=true, which forces the SetupScript to run.

That doesn't resolve the PATH issue, but it does confirm that the ip command works if I run it manually, which seems to indicate the PATH issue came from me transitioning the worker to start with cron on reboot.

karlkfi commented 6 years ago

Adding /sbin to the PATH in the crontab command seems to fix the issue.

I have no idea why PATH isn't set up, but that seems to be the case.

So I guess the upgrade was coincidental and the real issue was moving to using crontab to launch the worker.

It would be nice if the bash scripts were more defensive tho. Like maybe checking for dependencies before trying to run them, exiting on the first error instead of further corrupting the setup.

danail-branekov commented 6 years ago

Thanks for all that information! Indeed, cron runs as a daemon and its environment is not inherited from any other shell. man crontab describes what environment variables are set automatically, one should take care to set the required environment in the cron script.

As per script resiliency, I have raised a chore on the Garden public tracker to address that.

I am closing this issue as the root cause of this one is now clear, feel free to reopen it if you disagree.

Thanks!