Open tleyden opened 9 years ago
Correct, I'm running on GKE, and I was testing from scratch.
@tleyden Have you successfully launched a cluster since merging this commit?
Nope
Won't this issue be resolved once you move app-etcd into a service? It looks like you got an answer to your questions on https://groups.google.com/forum/#!msg/google-containers/rFIFD6Y0_Ew/PlYh0z7weLEJ?
Won't this issue be resolved once you move app-etcd into a service?
I don't think so, because I think this is basically expecting an env variable to be set for the couchbase server service, not the app-etcd service. Either that env variable is not being set, or it's not being resolved.
Is it just the env variable part of that commit that broke things? What if you just revert that line of the PR?
I don't think that aspect was working in automated fashion anyway -- it basically had something in the README that mentioned to go change the value of etcd.pod.ip
in the local copy of the file.
@bcbroussard's approach feels like the right one, we just need to figure out why it's not working on GKE. Btw he and I are scheduled to do a pair coding session this friday. (or if not him, someone on the Kismatic team)
Ok, sounds good. I just wish I hadn't killed my working cluster because I'm stuck now. Let me know how your coding session turns out. I'm hoping to this up and as stable as possible in the next few days. Thanks for always being so responsive.
@derekperkins the $COUCHBASE_SERVER_SERVICE_HOST
environment variable is automatically set by Kubernetes when the pod starts (as long as the couchbase-server
service has been created before the pod is started). Are you using a different name?
I was using couchbase-server
as my cluster name. Are you saying that the service has to completely boot before creating the controllers? I was running those pretty close to each other if I remember correctly.
@derekperkins when using environment variables, yes, the service has to be created before the pod. An alternative approach is to use the internal dns name, which should be something like couchbase-server.default.svc.cluster.local
or couchbase-server.default.svc.kubernetes.local
if you are using the default skydns setup.
You can check a pod's environment variables by ssh-ing into a node with the running pod and running
docker ps
docker inspect CONTAINER_ID
GKE does have skydns, but I think ideally we wanted to avoid/minimize using skydns since it's not guaranteed to be on every kubernetes cluster.
I tried switching to the skydns address that was in the yaml file, but that didn't work for me either.
@derekperkins what version or kubernetes are you running?
I think GKE is on 0.18.2
I'm just following the instructions exactly as laid out in the readme file for this repo.
Gonna assume https://github.com/couchbase/kubernetes/pull/12 fixes this, so closing this. Re-open if not.
It's still not booting for me. My sidekicks are barely even logging anything. This is the sum total of their logs. It actually took me a while to realize that that was the log and not a kubectl logs
error message. :)
$ kubectl logs couchbase-controller-ybzp5 couchbase-sidekick
2015/06/23 00:09:19 No target given in args
@derekperkins what does kubectl get endpoints
show? Are there endpoints for every service?
Sorry for the delay. My machine borked today and I ended up having to get a new one.
I tried from scratch again and still got nothing, but I did get a different error from the couchbase-sidekick
. It looks like the etcd constants aren't getting set properly. I just barely pulled and ran this, so my repo is up to date.
$ kubectl logs couchbase-controller-hh1by couchbase-sidekick
2015/06/24 08:05:52 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/24 08:05:55 target == SKIP
2015/06/24 08:05:55 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://:], target: couchbase-cluster
2015/06/24 08:05:55 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://:], target: couchbase-cluster
2015/06/24 08:05:55 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://:]
2015/06/24 08:05:55 ExtractEtcdServerList returning: http://:
2015/06/24 08:05:55 args: map[--etcd-servers:http://: start-couchbase-sidekick:true --discover-local-ip:true --k8s-service-name:<nil> remove-and-rebalance:false get-live-node-ip:false --help:false wait-until-running:false --local-ip:<nil>]
2015/06/24 08:05:55 need to discover local ip..
2015/06/24 08:05:55 Discovered local ip: 10.248.0.4
2015/06/24 08:05:55 Connect to explicit etcd servers: [http://:]
2015/06/24 08:05:55 Error getting key: /couchbase.com/userpass. Err: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]. Retrying in 10 secs
Here is what I have for kubectl get endpoints
NAME ENDPOINTS
couchbase-service 10.248.0.4:18092,10.248.1.4:18092,10.248.0.4:11207 + 11 more...
kube-dns 10.248.1.3:53,10.248.1.3:53
kubernetes 130.211.166.201:443
After reverting the couchbase-server.yaml
file to a hardcoded IP address and recreating the pods, the server then came online and the logs show it functioning inside the event loop. Despite having added the proper tags and firewall rules, I'm not able to browse to the Admin screen, as my connection is refused. I don't know what could be causing that.
After reverting the couchbase-server.yaml file to a hardcoded IP address and recreating the pods
Do you mean for the --etcd-servers
argument?
Correct. --etcd-servers http://10.248.0.3:2379
@derekperkins thanks for your feedback, I know what was going on. From your output I can see 2 things:
app-etcd
service does not exist/couchbase.com/userpass
wasn't set in etcdThe main README
is missing a step to create the app-etcd
service, even though the file exists in the repo. @tleyden can that step be added in? Then the couchbase-server.yaml
file will have the proper environment variables populated.
I got a little farther, but still can't access the admin. I added the app-etcd
service, deleted the couchbase-server controllers and recreated them using the default yaml file from the repo.
$ kubectl get endpoints
NAME ENDPOINTS
app-etcd-service 10.248.0.3:2380,10.248.0.3:2379
couchbase-service 10.248.0.8:18092,10.248.1.6:18092,10.248.0.8:11207 + 11 more...
kube-dns 10.248.1.3:53,10.248.1.3:53
kubernetes 130.211.166.201:443
sync-gateway-service 10.248.0.6:4984,10.248.0.7:4984
$ kubectl get pod
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE
app-etcd 10.248.0.3 gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252 name=app-etcd Running 16 hours
app-etcd tleyden5iwx/etcd-discovery Running 16 hours
couchbase-controller-64643 10.248.1.6 gke-couchbase-server-5ad717cc-node-q8mu/10.240.137.32 name=couchbase-server Running 9 minutes
couchbase-sidekick tleyden5iwx/couchbase-cluster-go:latest Running 9 minutes
couchbase-server couchbase/server:enterprise-4.0.0-beta Running 9 minutes
couchbase-controller-iq7b6 10.248.0.8 gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252 name=couchbase-server Running 9 minutes
couchbase-sidekick tleyden5iwx/couchbase-cluster-go:latest Running 9 minutes
couchbase-server couchbase/server:enterprise-4.0.0-beta Running 9 minutes
fluentd-cloud-logging-gke-couchbase-server-5ad717cc-node-2tr4 10.248.0.2 gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252 <none> Running 29 hours
fluentd-cloud-logging gcr.io/google_containers/fluentd-gcp:1.6 Running 29 hours
fluentd-cloud-logging-gke-couchbase-server-5ad717cc-node-q8mu 10.248.1.2 gke-couchbase-server-5ad717cc-node-q8mu/10.240.137.32 <none> Running 29 hours
fluentd-cloud-logging gcr.io/google_containers/fluentd-gcp:1.6 Running 29 hours
kube-dns-v3-u4s0f 10.248.1.3 gke-couchbase-server-5ad717cc-node-q8mu/10.240.137.32 k8s-app=kube-dns,kubernetes.io/cluster-service=true,version=v3 Running 29 hours
skydns gcr.io/google_containers/skydns:2015-03-11-001 Running 29 hours
kube2sky gcr.io/google_containers/kube2sky:1.9 Running 29 hours
etcd gcr.io/google_containers/etcd:2.0.9 Running 29 hours
sync-gateway-74t4v 10.248.0.7 gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252 name=sync-gateway Running 16 hours
sync-gateway couchbase/sync-gateway Running 37 seconds last termination: exit code 1
sync-gateway-cjoxc 10.248.0.6 gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252 name=sync-gateway Running 16 hours
sync-gateway couchbase/sync-gateway Running 37 seconds last termination: exit code 1
$ kubectl logs couchbase-controller-iq7b6 couchbase-sidekick
2015/06/25 00:12:43 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/25 00:12:47 target == SKIP
2015/06/25 00:12:47 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.242.85:2379], target: couchbase-cluster
2015/06/25 00:12:47 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.242.85:2379], target: couchbase-cluster
2015/06/25 00:12:47 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.242.85:2379]
2015/06/25 00:12:47 ExtractEtcdServerList returning: http://10.251.242.85:2379
2015/06/25 00:12:47 args: map[--discover-local-ip:true --k8s-service-name:<nil> get-live-node-ip:false --local-ip:<nil> --etcd-servers:http://10.251.242.85:2379 start-couchbase-sidekick:true remove-and-rebalance:false --help:false wait-until-running:false]
2015/06/25 00:12:47 need to discover local ip..
2015/06/25 00:12:47 Discovered local ip: 10.248.0.8
2015/06/25 00:12:47 Connect to explicit etcd servers: [http://10.251.242.85:2379]
2015/06/25 00:12:47 BecomeFirstClusterNode()
2015/06/25 00:12:47 Created key: /couchbase.com/couchbase-node-state
2015/06/25 00:12:47 Got error Get http://10.248.0.8:8091/pools: dial tcp 10.248.0.8:8091: connection refused trying to fetch details. Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 00:12:57 Version: 4.0.0-2213-enterprise
2015/06/25 00:12:57 We became first cluster node, init cluster and bucket
2015/06/25 00:12:57 ClusterInit()
2015/06/25 00:12:57 IsClusterPasswordSet()
2015/06/25 00:12:57 ClusterSetPassword()
2015/06/25 00:12:57 Using default username/password
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/settings/web
2015/06/25 00:12:57 Total RAM (MB) on machine: 3711
2015/06/25 00:12:57 Attempting to set cluster ram to: 2783 MB
2015/06/25 00:12:57 Using username/password pulled from etcd
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/pools/default
2015/06/25 00:12:57 CreateBucketWithRetries(): {Name:default RamQuotaMB:128 AuthType:none ReplicaNumber:0}
2015/06/25 00:12:57 Using username/password pulled from etcd
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/pools/default/buckets
2015/06/25 00:12:57 CreateBucket error: Failed to POST to http://10.248.0.8:8091/pools/default/buckets. Status code: 400. Body: {"errors":{"proxyPort":"port is already in use"},"summaries":{"ramSummary":{"total":2918187008,"otherBuckets":0,"nodesCount":1,"perNodeMegs":128,"thisAlloc":134217728,"thisUsed":0,"free":2783969280},"hddSummary":{"total":105553100800,"otherData":6333186048,"otherBuckets":0,"thisUsed":0,"free":99219914752}}}
2015/06/25 00:12:57 Sleeping 0 seconds
2015/06/25 00:12:57 Using username/password pulled from etcd
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/pools/default/buckets
2015/06/25 00:12:57 CreateBucket succeeded
2015/06/25 00:12:57 EventLoop()
@derekperkins can you run kubectl describe services couchbase-service
You should see your external load balancer ip listed under
LoadBalancer Ingress: 104.154.88.223
Then you should see the admin at http://104.154.88.223:8091
I will in 5-10 minutes. I nuked the cluster and am starting from scratch.
Getting closer! After getting the LoadBalancer Ingress
IP, I visited it and was greeted with the username / password prompt. After logging in, it then redirected me to the standard setup page, not the actual console.
Here's the output of kubectl describe services couchbase-service
$ kubectl describe services couchbase-service
W0624 19:19:43.121213 68804 request.go:291] field selector: v1beta3 - events - involvedObject.name - couchbase-service: need to check if this is versioned correctly.
W0624 19:19:43.121341 68804 request.go:291] field selector: v1beta3 - events - involvedObject.namespace - default: need to check if this is versioned correctly.
W0624 19:19:43.121348 68804 request.go:291] field selector: v1beta3 - events - involvedObject.kind - Service: need to check if this is versioned correctly.
W0624 19:19:43.121355 68804 request.go:291] field selector: v1beta3 - events - involvedObject.uid - 51c43579-1ad7-11e5-a755-42010af0d2c1: need to check if this is versioned correctly.
Name: couchbase-service
Labels: <none>
Selector: name=couchbase-server
Type: LoadBalancer
IP: 10.251.241.7
LoadBalancer Ingress: 104.154.57.215
Port: webadministrationport 8091/TCP
NodePort: webadministrationport 30661/TCP
Endpoints: 10.248.0.4:8091,10.248.1.4:8091
Port: couchbaseapiport 8092/TCP
NodePort: couchbaseapiport 30892/TCP
Endpoints: 10.248.0.4:8092,10.248.1.4:8092
Port: internalexternalbucketssl 11207/TCP
NodePort: internalexternalbucketssl 31701/TCP
Endpoints: 10.248.0.4:11207,10.248.1.4:11207
Port: internalexternalbucket 11210/TCP
NodePort: internalexternalbucket 32703/TCP
Endpoints: 10.248.0.4:11210,10.248.1.4:11210
Port: clientinterfaceproxy 11211/TCP
NodePort: clientinterfaceproxy 31503/TCP
Endpoints: 10.248.0.4:11211,10.248.1.4:11211
Port: internalresthttps 18091/TCP
NodePort: internalresthttps 32345/TCP
Endpoints: 10.248.0.4:18091,10.248.1.4:18091
Port: internalcapihttps 18092/TCP
NodePort: internalcapihttps 30695/TCP
Endpoints: 10.248.0.4:18092,10.248.1.4:18092
Session Affinity: None
No events.
Ok, I think I found the problem...
kubectl logs couchbase-controller-8r631 couchbase-sidekick
2015/06/25 01:14:30 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/25 01:14:33 target == SKIP
2015/06/25 01:14:33 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:14:33 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:14:33 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379]
2015/06/25 01:14:33 ExtractEtcdServerList returning: http://10.251.253.222:2379
2015/06/25 01:14:33 args: map[--help:false wait-until-running:false start-couchbase-sidekick:true --discover-local-ip:true remove-and-rebalance:false get-live-node-ip:false --etcd-servers:http://10.251.253.222:2379 --local-ip:<nil> --k8s-service-name:<nil>]
2015/06/25 01:14:33 need to discover local ip..
2015/06/25 01:14:33 Discovered local ip: 10.248.0.4
2015/06/25 01:14:33 Connect to explicit etcd servers: [http://10.251.253.222:2379]
2015/06/25 01:14:33 BecomeFirstClusterNode()
2015/06/25 01:14:33 Created key: /couchbase.com/couchbase-node-state
2015/06/25 01:14:33 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details. Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:14:43 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details. Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:14:53 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details. Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:15:03 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details. Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:15:13 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details. Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:15:23 Version: 4.0.0-2213-enterprise
2015/06/25 01:15:23 We became first cluster node, init cluster and bucket
2015/06/25 01:15:23 ClusterInit()
2015/06/25 01:15:23 IsClusterPasswordSet()
2015/06/25 01:15:23 ClusterSetPassword()
2015/06/25 01:15:23 Using default username/password
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/settings/web
2015/06/25 01:15:23 Total RAM (MB) on machine: 3711
2015/06/25 01:15:23 Attempting to set cluster ram to: 2783 MB
2015/06/25 01:15:23 Using username/password pulled from etcd
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/pools/default
2015/06/25 01:15:23 CreateBucketWithRetries(): {Name:default RamQuotaMB:128 AuthType:none ReplicaNumber:0}
2015/06/25 01:15:23 Using username/password pulled from etcd
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/pools/default/buckets
2015/06/25 01:15:23 CreateBucket error: Failed to POST to http://10.248.0.4:8091/pools/default/buckets. Status code: 400. Body: {"errors":{"proxyPort":"port is already in use"},"summaries":{"ramSummary":{"total":2918187008,"otherBuckets":0,"nodesCount":1,"perNodeMegs":128,"thisAlloc":134217728,"thisUsed":0,"free":2783969280},"hddSummary":{"total":105553100800,"otherData":4222124032,"otherBuckets":0,"thisUsed":0,"free":101330976768}}}
2015/06/25 01:15:23 Sleeping 0 seconds
2015/06/25 01:15:23 Using username/password pulled from etcd
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/pools/default/buckets
2015/06/25 01:15:23 CreateBucket succeeded
2015/06/25 01:15:23 EventLoop()
When both controllers looked good, I went to log in (I did this multiple times after deleting and recreating the controllers) and got the login screen. Once, I even got as far as the admin console for about 15 seconds, then it booted me back out into the setup screen. After getting kicked out, this is what the sidekick log looks like
kubectl logs couchbase-controller-v7flt couchbase-sidekick
2015/06/25 01:27:31 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/25 01:27:34 target == SKIP
2015/06/25 01:27:34 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:27:34 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:27:34 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379]
2015/06/25 01:27:34 ExtractEtcdServerList returning: http://10.251.253.222:2379
2015/06/25 01:27:34 args: map[--help:false --etcd-servers:http://10.251.253.222:2379 --discover-local-ip:true --k8s-service-name:<nil> remove-and-rebalance:false wait-until-running:false start-couchbase-sidekick:true --local-ip:<nil> get-live-node-ip:false]
2015/06/25 01:27:34 need to discover local ip..
2015/06/25 01:27:34 Discovered local ip: 10.248.1.6
2015/06/25 01:27:34 Connect to explicit etcd servers: [http://10.251.253.222:2379]
2015/06/25 01:27:34 BecomeFirstClusterNode()
2015/06/25 01:27:34 Key /couchbase.com/couchbase-node-state already exists
2015/06/25 01:27:34 Got error Get http://10.248.1.6:8091/pools: dial tcp 10.248.1.6:8091: connection refused trying to fetch details. Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:27:44 Version: 4.0.0-2213-enterprise
2015/06/25 01:27:44 JoinExistingCluster() called
2015/06/25 01:27:44 Calling FindLiveNode()
2015/06/25 01:27:44 Couchbase node ip: 10.248.0.4
2015/06/25 01:27:44 Verifying REST service at http://10.248.0.4:8091/ to be up
2015/06/25 01:27:44 liveNodeIp: 10.248.0.4
2015/06/25 01:27:44 JoinLiveNode() called with 10.248.0.4
2015/06/25 01:27:44 GetClusterNodes() called with: 10.248.0.4
2015/06/25 01:27:44 No cluster node found for 10.248.1.6. Not retrying
2015/06/25 01:27:44 WaitUntilInClusterAndHealthy() returned error: Unable to find node with hostname 10.248.1.6 in [map[services:[kv] otpCookie:hzmbyaycqepsfmxj ports:map[sslProxy:11214 httpsMgmt:18091 httpsCAPI:18092 proxy:11211 direct:11210] couchApiBase:http://10.248.0.4:8092/ couchApiBaseHTTPS:https://10.248.0.4:18092/ otpNode:ns_1@127.0.0.1 thisNode:true clusterCompatibility:262144 version:4.0.0-2213-enterprise interestingStats:map[curr_items:0 curr_items_tot:0 ep_bg_fetched:0 ops:0 vb_replica_curr_items:0 cmd_get:0 couch_docs_actual_disk_size:4.257466e+06 couch_docs_data_size:4.241408e+06 couch_views_actual_disk_size:0 couch_views_data_size:0 get_hits:0 mem_used:3.8265264e+07] uptime:747 clusterMembership:active recoveryType:none hostname:10.248.0.4:8091 os:x86_64-unknown-linux-gnu systemStats:map[mem_free:3.047079936e+09 cpu_utilization_rate:3.092783505154639 swap_total:0 swap_used:0 mem_total:3.892027392e+09] memoryTotal:3.892027392e+09 mcdMemoryAllocated:2969 status:healthy memoryFree:3.047079936e+09 mcdMemoryReserved:2969]]. Call AddNodeRetry()
2015/06/25 01:27:44 AddNode()
2015/06/25 01:27:44 AddNode posting to http://10.248.0.4:8091/controller/addNode with data: hostname=10.248.1.6&password=7vmRpf9wZLQwkvbc&user=cbadmin
2015/06/25 01:27:44 Using username/password pulled from etcd
2015/06/25 01:27:44 POST to http://10.248.0.4:8091/controller/addNode
2015/06/25 01:27:49 AddNode failed with err: Failed to POST to http://10.248.0.4:8091/controller/addNode. Status code: 400. Body: ["Failed to reach erlang port mapper. Timeout connecting to \"10.248.1.6\" on port \"4369\". This could be due to an incorrect host/port combination or a firewall in place between the servers."]. Will retry in 10 secs
2015/06/25 01:27:59 AddNode()
2015/06/25 01:27:59 AddNode posting to http://10.248.0.4:8091/controller/addNode with data: hostname=10.248.1.6&password=7vmRpf9wZLQwkvbc&user=cbadmin
2015/06/25 01:27:59 Using username/password pulled from etcd
2015/06/25 01:27:59 POST to http://10.248.0.4:8091/controller/addNode
2015/06/25 01:28:04 AddNode failed with err: Failed to POST to http://10.248.0.4:8091/controller/addNode. Status code: 400. Body: ["Failed to reach erlang port mapper. Timeout connecting to \"10.248.1.6\" on port \"4369\". This could be due to an incorrect host/port combination or a firewall in place between the servers."]. Will retry in 20 secs
2015/06/25 01:28:24 AddNode()
2015/06/25 01:28:24 AddNode posting to http://10.248.0.4:8091/controller/addNode with data: hostname=10.248.1.6&password=7vmRpf9wZLQwkvbc&user=cbadmin
2015/06/25 01:28:24 Using username/password pulled from etcd
2015/06/25 01:28:24 POST to http://10.248.0.4:8091/controller/addNode
2015/06/25 01:28:29 AddNode failed with err: Failed to POST to http://10.248.0.4:8091/controller/addNode. Status code: 400. Body: ["Failed to reach erlang port mapper. Timeout connecting to \"10.248.1.6\" on port \"4369\". This could be due to an incorrect host/port combination or a firewall in place between the servers."]. Will retry in 30 secs
$ kubectl get endpoints
NAME ENDPOINTS
app-etcd-service 10.248.1.3:2380,10.248.1.3:2379
couchbase-service 10.248.0.4:18092,10.248.1.6:18092,10.248.0.4:11207 + 11 more...
kube-dns 10.248.0.3:53,10.248.0.3:53
kubernetes 130.211.147.104:443
So 10.248.1.6 is trying to join itself to existing cluster member 10.248.0.4, but when 10.248.0.4 tries to callback to 10.248.1.6 on port 4369 it fails.
I would suggest "docker exec'ing" into the couchbase container running on 10.248.0.4, and then ping 10.248.1.6 to make sure there is a route, and then try running telnet 10.248.1.6 4369
.
Another thing you could try is "docker exec'ing" into the couchbase container running on 10.248.1.6, and try running telnet localhost 4369
I'm not able to get into either of those containers for some weird reason. It's telling me that the host doesn't exist when running kubectl exec
, and it's throwing a TLS error on docker exec
from inside the VM.
$ gcloud compute instances list
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP STATUS
gke-couchbase-server-96b2a542-node-6sep us-central1-b n1-standard-1 10.240.13.115 RUNNING
gke-couchbase-server-96b2a542-node-c278 us-central1-b n1-standard-1 10.240.70.131 RUNNING
$ kubectl exec couchbase-controller-8r631 bash
error: Unable to upgrade connection: {
"kind": "Status",
"apiVersion": "v1beta3",
"metadata": {},
"status": "Failure",
"message": "dial tcp: lookup gke-couchbase-server-96b2a542-node-6sep: no such host",
"code": 500
}
kubectl exec couchbase-controller-v7flt bash
error: Unable to upgrade connection: {
"kind": "Status",
"apiVersion": "v1beta3",
"metadata": {},
"status": "Failure",
"message": "dial tcp: lookup gke-couchbase-server-96b2a542-node-c278: no such host",
"code": 500
}
$ gcloud compute ssh gke-couchbase-server-96b2a542-node-c278
...
$ docker exec 10.248.0.4 bash
FATA[0000] Post http:///var/run/docker.sock/v1.18/containers/10.248.0.4/exec: dial unix /var/run/docker.sock: permission denied. Are you trying to connect to a TLS-enabled daemon without TLS?
What about sudo docker exec 10.248.0.4 bash
?
The command worked this time, but I logged into both VMs and neither one was able to connect.
$ sudo docker exec 10.248.0.4 bash
FATA[0000] Error response from daemon: no such id: 10.248.0.4
@tleyden - I noticed that port 4369
is not defined in the couchbase-service.yaml
, and it's not exposed in the coucbase-server.yaml
file either. Unless you can think of something else, it seems like that might be the culprit.
I thought the couchbase containers would be directly communicating between themselves, and not through a service? (and therefore, couchbase-service.yaml
shouldn't need port 4369 to be defined)
For the containerPort parameter in coucbase-server.yaml, is that saying that only that port will be visible to other containers? If so, we'll need to fix that, since couchbase server has a bunch of ports
@derekperkins oh, with docker exec you need to pass the container id. If you first run docker ps
, you can get the container id. Then run sudo docker exec -ti container_id bash
@tleyden That got me into the container, and when I ran ping 10.248.1.6
, it found it. I did get an error: bash: telnet: command not found
, so I couldn't try out the second part.
You can just install telnet via yum install -y telnet
Ok, that connected.
[root@couchbase-controller-v7flt /]# telnet 10.248.1.6 4369
Trying 10.248.1.6...
Connected to 10.248.1.6.
I logged into the other one and it worked too, though not on ipv6.
[root@couchbase-controller-8r631 /]# telnet localhost 4369
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
I tried again from scratch, wondering if I had messed something up logging into the servers and it's still the same story. I get all the way through a successful login, it flashes the admin panel and then kicks me out.
Is the problem that it can't create the default bucket?
015/06/25 21:01:45 POST to http://10.212.0.4:8091/pools/default/buckets
2015/06/25 21:01:45 CreateBucket error: Failed to POST to http://10.212.0.4:8091/pools/default/buckets. Status code: 400. Body: {"errors":{"proxyPort":"port is already in use"},"summaries":{"ramSummary":{"total":1340080128,"otherBuckets":0,"nodesCount":1,"perNodeMegs":128,"thisAlloc":134217728,"thisUsed":0,"free":1205862400},"hddSummary":{"total":105553100800,"otherData":3166593024,"otherBuckets":0,"thisUsed":0,"free":102386507776}}}
@derekperkins the logout issue is probably caused by the service alternating requests between the 2 pods. In that case, you can try scaling down to 1 pod, making config changes, and then scaling back up.
kubectl scale --replicas=1 rc couchbase-controller
#access web ui
kubectl scale --replicas=2 rc couchbase-controller
@tleyden yes, if the port needs to be external then it should go into the service - like the Web Administration Port. If they are internal only, then they just need to be defined in the container (with expose).
I can think of 2 ways to fix the service load balancing issue.
couchbase-server-master.yaml
, include a label master
and add a selector
to the service. There will only be 1 master. Then create a couchbase-server-slave.yaml
which can be scaled up or down. #Accessing a pod directly
$ kubectl get pods
#Find the name of couchbase pod #1 and replace it in the link below.
# Open this link in your browser.
https://<K8s master>/api/v1beta3/proxy/namespaces/default/pods/couchbase-controller-mnke4:8091/
When I scale back up, the brief looks I get at the console now show 3 servers being down, even though there were only 2 to begin with, and at least 1 or 2 are running because it's booting me out.
@bcbroussard At the beginning of this process, I was able to login to the web ui directly via any of the box external IPs, but I can't now.
Without the web ui, how can I check the server status? I have no way of knowing how many servers it thinks is online or whether rebalancing works properly after scaling the system back up.
I'm not trying to be a pest, just wondering what the timeline looks for getting this issue fixed.
@derekperkins np, thanks for checking in. I can't make any timeline promises, but I will try to take a look at this as soon as I can.
Thanks for that. To get something up and running in the short term, should I just look at throwing CB 4.0 onto Compute Engine while GKE and this project mature? Is your article on it still valid a year later? http://tleyden.github.io/blog/2014/06/22/running-couchbase-server-on-gce/?
The last PR merge broke things badly.
From @derekperkins
I never set the $COUCHBASE_SERVER_SERVICE_HOST environment variable, so it failing wasn't a surprise -- I'm just not sure where I'm supposed to be pointing it.
@bcbroussard any ideas? @derekperkins is on GKE as far as I know.