eddiewebb / circleci-dmz-orb

Allows CircleCI builds to access private network services over a intermediate jump host using SSH port forwarding.
https://circleci.com
17 stars 8 forks source link

Tunnel creation is not stable without ExitOnForwardFailure and extra sleep #4

Open roman-finix opened 3 years ago

roman-finix commented 3 years ago

first of all I hope https://circleci.com/developer/orbs/orb/eddiewebb/dmz#orb-source is this repo.

We have setup of this orb and it worked most of time stable. But there are bunch of cases when tunnel creation is finished successfully but our next step is failing connect nu tunnel. Unrelated to this orb log:

2020/07/13 23:30:41 Waiting for: tcp://localhost:****
2020/07/13 23:30:41 Problem with dial: dial tcp 127.0.0.1:****: connect: connection refused. Sleeping 1s
2020/07/13 23:30:42 Problem with dial: dial tcp 127.0.0.1:****: connect: connection refused. Sleeping 1s
2020/07/13 23:30:43 

restart of the whole CircleCI job, helps. After bunch of experiments we found a solution by patching of orb with 1 extra argument for ssh and one extra command to let tunnel be created in our case sleep

example of modification:

      - run:
          # MODIFICATION: -o ExitOnForwardFailure=yes
          # MODIFICATION: sleep 5
          command: |
            ssh -o ExitOnForwardFailure=yes -4 -L <<parameters.local_port>>:<<parameters.target_host>>:<<parameters.target_port>> -Nf <<parameters.bastion_user>>@<<parameters.bastion_host>>
            sleep 5
          name: Open Local Port Forwarding on <<parameters.local_port>> to <<parameters.target_host>>:<<parameters.target_port>>
            via <<parameters.bastion_host>>

After more then 6 month of no problems I think we can discuss how to contribute it original repo.


I clearly understand that "sleep 5" is not ideal solution to keep, and what I can suggest is to make allow user to execute any command he wants eval << parameters.post_ssh_command>> to let use make any verification of tunnel or simply sleep X up to his amount and nuances of network.

roman-finix commented 3 years ago

from ssh manual:

-f
    Requests ssh to go to background just before command execution. 
    This is useful if ssh is going to ask for passwords or passphrases,
     but the user wants it in the background. This implies -n. 
     The recommended way to start X11 programs at a remote site is 
     with something like ssh -f host xterm.

    If the ExitOnForwardFailure configuration option is set to “yes”, 
    then a client started with -f will wait for all remote port forwards
     to be successfully established before placing itself in the 
     background. 

After investigation I found there is ssh property “-o ExitOnForwardFailure=yes” that might contribute on “… will wait …“. So usage of it is beneficial.After a lot of testing by usage of this option, I still get failures, a bit less frequent(subjective impression) but still there are failures.

I did coding like in orb ( but I never got in state of happening echo "retry tunel ..." , and I never have connection problems, 2x executions of jobs without failure):

#            netstat -tna | grep 'LISTEN\>'
#            while ! netstat -tna | grep 'LISTEN\>' | grep -q '.5432'; do   sleep 1; done;
#            if ! dockerize -wait tcp://localhost:5432 -timeout 1m; then
#              echo "retry tunel ..."
#              touch retry.txt
#              ls -la
#              ssh -o ExitOnForwardFailure=yes -4 -L <<parameters.local_port>>:<<parameters.target_host>>:<<parameters.target_port>> -Nf <<parameters.bastion_user>>@<<parameters.bastion_host>>
#              dockerize -wait tcp://localhost:5432 -timeout 1m
#            fi  

      - run:
          name: check retry
          command: |
            ls -la
            if [ -f retry.txt ]; then false; else echo "there WAS retry"; echo "there was no retry"; true; fi

it it gave me idea that problem is that CircleCi execution of ssh commands might be to quick and wrapper shell environment is killing/damaging ssh execution and not let ssh to initialize properly.I did just simple “sleep 5“ instead of code above and it works same stable (at least I very lucky to have no failures).