cloud-barista / cb-ladybug

Cloud-Barista Multi-Cloud Application Runtime Framework : Support Multi-Cloud Kubernetes Service
Apache License 2.0
11 stars 11 forks source link

Error occurs while running `./cluster-create.sh` #75

Closed jihoon-seo closed 3 years ago

jihoon-seo commented 3 years ago

What happened :

❯ ./cluster-create.sh cb sjh1

[INFO]
- Namespace                  is 'cb'
- Cluster name               is 'sjh1'

------------------------------------------------------------------------------

(Wait for it...)

{
  "message": "copy scripts error (server=13.113.190.119:22, cause=dial tcp 13.113.190.119:22: connect: connection refused)"
}
INFO[5834] Not found data (status=404, method=GET, url=/ns/cb/mcis/sjh1)
INFO[5834] start create vpc (name=aws-ap-northeast-1-vpc)
INFO[5834] Not found data (status=404, method=GET, url=/ns/cb/resources/vNet/aws-ap-northeast-1-vpc)
INFO[5836] create vpc OK.. (name=aws-ap-northeast-1-vpc)
INFO[5836] start create firewall (name=aws-ap-northeast-1-sg)
INFO[5836] Not found data (status=404, method=GET, url=/ns/cb/resources/securityGroup/aws-ap-northeast-1-sg)
INFO[5837] create firewall OK.. (name=aws-ap-northeast-1-sg)
INFO[5837] start create ssh key (name=aws-ap-northeast-1-sshkey)
INFO[5837] Not found data (status=404, method=GET, url=/ns/cb/resources/sshKey/aws-ap-northeast-1-sshkey)
INFO[5837] create ssh key OK.. (name=aws-ap-northeast-1-sshkey)
INFO[5837] AMI find OK (ami='ami-02b658ac34935766f', region='ap-northeast-1')
INFO[5837] start create image (name=aws-ap-northeast-1-ubuntu1804)
INFO[5837] Not found data (status=404, method=GET, url=/ns/cb/resources/image/aws-ap-northeast-1-ubuntu1804)
INFO[5837] create image OK.. (name=aws-ap-northeast-1-ubuntu1804)
INFO[5837] start create spec (name=aws-ap-northeast-1-t2-medium-spec)
INFO[5837] Not found data (status=404, method=GET, url=/ns/cb/resources/spec/aws-ap-northeast-1-t2-medium-spec)
INFO[5837] create spec OK.. (name=aws-ap-northeast-1-t2-medium-spec)
INFO[5837] start create vpc (name=gcp-asia-northeast3-vpc)
INFO[5837] Not found data (status=404, method=GET, url=/ns/cb/resources/vNet/gcp-asia-northeast3-vpc)
INFO[5866] create vpc OK.. (name=gcp-asia-northeast3-vpc)
INFO[5866] start create firewall (name=gcp-asia-northeast3-sg)
INFO[5866] Not found data (status=404, method=GET, url=/ns/cb/resources/securityGroup/gcp-asia-northeast3-sg)
INFO[5871] create firewall OK.. (name=gcp-asia-northeast3-sg)
INFO[5871] start create ssh key (name=gcp-asia-northeast3-sshkey)
INFO[5871] Not found data (status=404, method=GET, url=/ns/cb/resources/sshKey/gcp-asia-northeast3-sshkey)
INFO[5872] create ssh key OK.. (name=gcp-asia-northeast3-sshkey)
INFO[5872] start create image (name=gcp-asia-northeast3-ubuntu1804)
INFO[5872] Not found data (status=404, method=GET, url=/ns/cb/resources/image/gcp-asia-northeast3-ubuntu1804)
INFO[5872] create image OK.. (name=gcp-asia-northeast3-ubuntu1804)
INFO[5872] start create spec (name=gcp-asia-northeast3-n1-standard-2-spec)
INFO[5872] Not found data (status=404, method=GET, url=/ns/cb/resources/spec/gcp-asia-northeast3-n1-standard-2-spec)
INFO[5872] create spec OK.. (name=gcp-asia-northeast3-n1-standard-2-spec)
INFO[5872] start create MCIS (name=sjh1)
INFO[5905] create MCIS OK.. (name=sjh1)
INFO[5905] start k8s bootstrap
WARN[5908] connection test error (server=13.113.190.119:22, cause=dial tcp 13.113.190.119:22: connect: connection refused)
WARN[5908] connection test error (server=13.113.190.119:22, cause=dial tcp 13.113.190.119:22: connect: connection refused)
INFO[5908] start script file copy (vm=sjh1-c-1-50wcy, src=/home/jhseo/go/src/github.com/cloud-barista/cb-ladybug/src/scripts, dest=/tmp)
ERRO[5908] copy scripts error (server=13.113.190.119:22, cause=dial tcp 13.113.190.119:22: connect: connection refused)
WARN[5922] connection test error (server=34.64.173.210:22, cause=dial tcp 34.64.173.210:22: connect: connection refused)
WARN[5922] connection test error (server=34.64.173.210:22, cause=dial tcp 34.64.173.210:22: connect: connection refused)
INFO[5922] start script file copy (vm=sjh1-w-2-3nwzm, src=/home/jhseo/go/src/github.com/cloud-barista/cb-ladybug/src/scripts, dest=/tmp)
WARN[5922] connection test error (server=34.64.155.57:22, cause=ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain)
WARN[5922] connection test error (server=34.64.155.57:22, cause=ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain)
INFO[5922] start script file copy (vm=sjh1-w-1-y7kr6, src=/home/jhseo/go/src/github.com/cloud-barista/cb-ladybug/src/scripts, dest=/tmp)

13.113.190.119 에 대해서 WARN[5908] connection test error 가 2번 발생했고 INFO[5908] start script file copy 에서 SSH 접속이 가능해졌다고 판단한 것으로 보이는데 ERRO[5908] copy scripts error (server=13.113.190.119:22 가 발생했습니다.

이로 인해 테스트 스크립트가 에러와 함께 종료되었습니다.

What you expected to happen :

How to reproduce it (as minimally and precisely as possible) : ❯ ./cluster-create.sh cb sjh1

Anything else we need to know? :

Environment

Proposed solution :

Any other context :

jihoon-seo commented 3 years ago

두 번째 시도에서는 성공했습니다. (MCIR 삭제하지 않고 재활용)

INFO[7365] create MCIS OK.. (name=sjh1)
INFO[7365] start k8s bootstrap
INFO[7369] start script file copy (vm=sjh1-w-2-lj1o0, src=/home/jhseo/go/src/github.com/cloud-barista/cb-ladybug/src/scripts, dest=/tmp)
INFO[7369] start script file copy (vm=sjh1-w-1-hxxrm, src=/home/jhseo/go/src/github.com/cloud-barista/cb-ladybug/src/scripts, dest=/tmp)
INFO[7372] end script file copy (vm=sjh1-w-1-hxxrm, server=34.64.215.181:22)
INFO[7373] end script file copy (vm=sjh1-w-2-lj1o0, server=34.64.173.210:22)
Created symlink /etc/systemd/system/kubelet.service.wants/ladybug-bootstrap.service → /lib/systemd/system/ladybug-bootstrap.service.
Created symlink /etc/systemd/system/kubelet.service.wants/ladybug-bootstrap.service → /lib/systemd/system/ladybug-bootstrap.service.
INFO[7384] start script file copy (vm=sjh1-c-1-9k16i, src=/home/jhseo/go/src/github.com/cloud-barista/cb-ladybug/src/scripts, dest=/tmp)
debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Warning: apt-key output should not be parsed (stdout is not a terminal)
Warning: apt-key output should not be parsed (stdout is not a terminal)

...

INFO[7676] install networkCNI
INFO[7678] end k8s init
INFO[7678] start k8s join
INFO[7678] worker join (vm=sjh1-w-1-hxxrm)
W0708 10:27:02.911064   14105 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
INFO[7699] worker join (vm=sjh1-w-2-lj1o0)
W0708 10:27:23.498776   13779 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
INFO[7708] end k8s join
INFO[7708] duration := 6m30.624747945s
jihoon-seo commented 3 years ago

(by @powerkimhub)

제가 SP, TB, LB 만 실행하고 CB-Dragonfly 는 실행하지 않은 상태였습니다.

CB-Dragonfly 가 실행되어 있었으면 CB-Tumblebug 이 VM을 생성하고 CB-Dragonfly 모니터링 에이전트를 설치하는 과정에서 SSH 접속이 가능하다는 것이 검증되므로 CB-Ladybug이 SSH 접속에 실패하지 않았을 것입니다.

powerkimhub commented 3 years ago

@sykim-etri

seokho-son commented 3 years ago

@jihoon-seo @powerkimhub

CB-Dragonfly 가 실행되어 있었으면
CB-Tumblebug 이 VM을 생성하고 CB-Dragonfly 모니터링 에이전트를 설치하는 과정에서
SSH 접속이 가능하다는 것이 검증되므로
CB-Ladybug이 SSH 접속에 실패하지 않았을 것입니다.

이 문구 관련해서, 오해의 소지가 있는 듯해서 댓글을 답니다.

CB-TB 가 Dragonfly Agent 를 MCIS(VM) 생성시, 디폴트로 SSH 접속 여부도 확인하고 Agent 설치를 지속적으로 Try 하지만 Agent 설치가 가능하다는 것과 그 이후로도 SSH가 계속 가능하다는 것을 보장하지는 않습니다.

따라서, Spider나 TB에 의존하기 보다는, CB-ladybug도 SSH 불가 가능성을 염두에 두고 관련 로직이 처리되는 것이 바람직하지 않을까 생각합니다. (CB-Ladybug에서 당연히 그렇게 해나가실거라 생각합니다..^^)

powerkimhub commented 3 years ago

@sykim-etri

powerkimhub commented 3 years ago

@sykim-etri

@seokho-son

sykim-etri commented 3 years ago

@seokho-son @powerkimhub

powerkimhub commented 3 years ago

@sykim-etri @seokho-son @jihoon-seo

[처리 결과]


sykim-etri commented 3 years ago

@seokho-son 현재 CB-TB에서 SSH 접속 여부 체크 방식은 어디서 확인할 수 있을까요?

seokho-son commented 3 years ago

@sykim-etri https://cloud-barista.github.io/cb-tumblebug-api-web/?url=https://raw.githubusercontent.com/cloud-barista/cb-tumblebug/main/src/api/rest/docs/swagger.yaml#/%5BMCIS%5D%20Remote%20command/post_ns__nsId__cmd_mcis__mcisId_

따로 SSH 접속 여부 체크 기능을 API로 제공하지는 않고욤.

원격 거맨드 기능으로 임의의 명령을 수행해보시면 될 것 같습니다. (TB에서는 hostname이나 ls를 해보고 있습니다.)

sykim-etri commented 3 years ago

@seokho-son 제가 질문을 잘못했습니다. 현재 CB-TB에서 Dragonfly 설치를 위해 SSH 접속을 시도하는 코드를 어디서 찾으면 좋을까요? 현재 CB-LB는 2회만 접속시도하므로 CB-TB와 동일한 수준에서 접속을 시도하려고 합니다.ㅎㅎ

seokho-son commented 3 years ago

@sykim-etri

https://github.com/cloud-barista/cb-tumblebug/blob/be7aeb8720c25be657a92afc3557f8d71176bb00/src/core/mcis/control.go#L314 여기에서 verifySSHUserName 을 하고, SSH 가능 여부가 확인 되게 됩니다.

1) vm status 확인 (running 아니면 추가 진행하지 않음) 2) vm ip 확인 (ip 지정되어 있지 않으면 추가 진행하지 않음) 3) ssh를 위한 username이 CB-TB에 의해 검증되었는지 확인 4) 검증이 안 되어 있는 경우, TCP SSH 포트가 열려있는지 VM 연결성 재확인 10회 (타임아웃: 10초, 매 2초 슬립) : https://github.com/cloud-barista/cb-tumblebug/blob/be7aeb8720c25be657a92afc3557f8d71176bb00/src/core/mcis/sshrun.go#L189 5) username 탐색을 위해 ssh (ls) 를 수행하여 정상 응답이 오는지 확인

실재 Agent 를 탑재하는 코드는 https://github.com/cloud-barista/cb-tumblebug/blob/be7aeb8720c25be657a92afc3557f8d71176bb00/src/core/mcis/control.go#L2000 이고,

VM 생성 후 강제로 20초의 딜레이를 주고 에이전트 설치 전에 상기 verifySSHUserName 를 수행합니다.

일단 CB-TB는 초안으로 이렇게 되어 있고요.. (상당히 오래 여러번 확인...하는 방어적인 코드..)

CB-LB에서는 더 바람직한 방법으로 하시면 좋을 것 같습니다. (저는 프로그래밍을 잘하는 사람이 아니어요...ㅠㅠ)

sykim-etri commented 3 years ago

@seokho-son 상세한 안내 감사합니다. 안정성 제공을 위한 CB-TB의 노력을 간접적으로 확인할 수 있었습니다.^^

sykim-etri commented 3 years ago

일부 CSP(오픈스택)에서 생성한 VM의 경우 cloud-init 스크립트 실행으로 SSHD 실행이 꽤 지연되는 경우도 있음. 이를 고려한 해결 방안이 필요합니다. https://cloud-barista.slack.com/archives/CLFCLNFTJ/p1625817016229400?thread_ts=1625801251.224700&cid=CLFCLNFTJ

sykim-etri commented 3 years ago

참조: https://www.golinuxcloud.com/test-ssh-connection/

# cat /tmp/check_connectivity.sh
#!/bin/bash

server=10.10.10.10      # server IP
port=22                 # port
connect_timeout=5       # Connection timeout

ssh -q -o BatchMode=yes  -o StrictHostKeyChecking=no -o ConnectTimeout=$connect_timeout $server 'exit 0'
if [ $? == 0 ];then
   echo "SSH Connection to $server over port $port is possible"
else
   echo "SSH connection to $server over port $port is not possible"
fi