couchbase / kubernetes

Deprecated. Please use the Couchbase Autonomous Operator
https://www.couchbase.com/products/cloud/kubernetes
43 stars 15 forks source link

Unable to connect to server: Get http://$COUCHBASE_SERVER_SERVICE_HOST:8091 #11

Open tleyden opened 9 years ago

tleyden commented 9 years ago

The last PR merge broke things badly.

From @derekperkins

I just tried recreating the sync gateway controller, but I'm not sure what to set the IP as

Opening Couchbase database default on <http://$COUCHBASE_SERVER_SERVICE_HOST:8091>
19:15:26.744388 FATAL: Error opening database: 502 Unable to connect to server: Get http://$COUCHBASE_SERVER_SERVICE_HOST:8091/pools: dial tcp: lookup $COUCHBASE_SERVER_SERVICE_HOST: no such host -- rest.RunServer() at config.go:415

I never set the $COUCHBASE_SERVER_SERVICE_HOST environment variable, so it failing wasn't a surprise -- I'm just not sure where I'm supposed to be pointing it.

@bcbroussard any ideas? @derekperkins is on GKE as far as I know.

derekperkins commented 9 years ago

Correct, I'm running on GKE, and I was testing from scratch.

derekperkins commented 9 years ago

@tleyden Have you successfully launched a cluster since merging this commit?

tleyden commented 9 years ago

Nope

derekperkins commented 9 years ago

Won't this issue be resolved once you move app-etcd into a service? It looks like you got an answer to your questions on https://groups.google.com/forum/#!msg/google-containers/rFIFD6Y0_Ew/PlYh0z7weLEJ?

tleyden commented 9 years ago

Won't this issue be resolved once you move app-etcd into a service?

I don't think so, because I think this is basically expecting an env variable to be set for the couchbase server service, not the app-etcd service. Either that env variable is not being set, or it's not being resolved.

derekperkins commented 9 years ago

Is it just the env variable part of that commit that broke things? What if you just revert that line of the PR?

tleyden commented 9 years ago

I don't think that aspect was working in automated fashion anyway -- it basically had something in the README that mentioned to go change the value of etcd.pod.ip in the local copy of the file.

@bcbroussard's approach feels like the right one, we just need to figure out why it's not working on GKE. Btw he and I are scheduled to do a pair coding session this friday. (or if not him, someone on the Kismatic team)

derekperkins commented 9 years ago

Ok, sounds good. I just wish I hadn't killed my working cluster because I'm stuck now. Let me know how your coding session turns out. I'm hoping to this up and as stable as possible in the next few days. Thanks for always being so responsive.

bcbroussard commented 9 years ago

@derekperkins the $COUCHBASE_SERVER_SERVICE_HOST environment variable is automatically set by Kubernetes when the pod starts (as long as the couchbase-server service has been created before the pod is started). Are you using a different name?

derekperkins commented 9 years ago

I was using couchbase-server as my cluster name. Are you saying that the service has to completely boot before creating the controllers? I was running those pretty close to each other if I remember correctly.

bcbroussard commented 9 years ago

@derekperkins when using environment variables, yes, the service has to be created before the pod. An alternative approach is to use the internal dns name, which should be something like couchbase-server.default.svc.cluster.local or couchbase-server.default.svc.kubernetes.local if you are using the default skydns setup.

You can check a pod's environment variables by ssh-ing into a node with the running pod and running

docker ps
docker inspect CONTAINER_ID
tleyden commented 9 years ago

GKE does have skydns, but I think ideally we wanted to avoid/minimize using skydns since it's not guaranteed to be on every kubernetes cluster.

derekperkins commented 9 years ago

I tried switching to the skydns address that was in the yaml file, but that didn't work for me either.

bcbroussard commented 9 years ago

@derekperkins what version or kubernetes are you running?

derekperkins commented 9 years ago

I think GKE is on 0.18.2

derekperkins commented 9 years ago

I'm just following the instructions exactly as laid out in the readme file for this repo.

tleyden commented 9 years ago

Gonna assume https://github.com/couchbase/kubernetes/pull/12 fixes this, so closing this. Re-open if not.

derekperkins commented 9 years ago

It's still not booting for me. My sidekicks are barely even logging anything. This is the sum total of their logs. It actually took me a while to realize that that was the log and not a kubectl logs error message. :)

$ kubectl logs couchbase-controller-ybzp5 couchbase-sidekick
2015/06/23 00:09:19 No target given in args
bcbroussard commented 9 years ago

@derekperkins what does kubectl get endpoints show? Are there endpoints for every service?

derekperkins commented 9 years ago

Sorry for the delay. My machine borked today and I ended up having to get a new one.

I tried from scratch again and still got nothing, but I did get a different error from the couchbase-sidekick. It looks like the etcd constants aren't getting set properly. I just barely pulled and ran this, so my repo is up to date.

$ kubectl logs couchbase-controller-hh1by couchbase-sidekick
2015/06/24 08:05:52 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/24 08:05:55 target == SKIP
2015/06/24 08:05:55 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://:], target: couchbase-cluster
2015/06/24 08:05:55 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://:], target: couchbase-cluster
2015/06/24 08:05:55 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://:]
2015/06/24 08:05:55 ExtractEtcdServerList returning: http://:
2015/06/24 08:05:55 args: map[--etcd-servers:http://: start-couchbase-sidekick:true --discover-local-ip:true --k8s-service-name:<nil> remove-and-rebalance:false get-live-node-ip:false --help:false wait-until-running:false --local-ip:<nil>]
2015/06/24 08:05:55 need to discover local ip..
2015/06/24 08:05:55 Discovered local ip: 10.248.0.4
2015/06/24 08:05:55 Connect to explicit etcd servers: [http://:]
2015/06/24 08:05:55 Error getting key: /couchbase.com/userpass.  Err: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0].  Retrying in 10 secs

Here is what I have for kubectl get endpoints

NAME                ENDPOINTS
couchbase-service   10.248.0.4:18092,10.248.1.4:18092,10.248.0.4:11207 + 11 more...
kube-dns            10.248.1.3:53,10.248.1.3:53
kubernetes          130.211.166.201:443
derekperkins commented 9 years ago

After reverting the couchbase-server.yaml file to a hardcoded IP address and recreating the pods, the server then came online and the logs show it functioning inside the event loop. Despite having added the proper tags and firewall rules, I'm not able to browse to the Admin screen, as my connection is refused. I don't know what could be causing that.

tleyden commented 9 years ago

After reverting the couchbase-server.yaml file to a hardcoded IP address and recreating the pods

Do you mean for the --etcd-servers argument?

derekperkins commented 9 years ago

Correct. --etcd-servers http://10.248.0.3:2379

bcbroussard commented 9 years ago

@derekperkins thanks for your feedback, I know what was going on. From your output I can see 2 things:

  1. the app-etcd service does not exist
  2. /couchbase.com/userpass wasn't set in etcd

The main README is missing a step to create the app-etcd service, even though the file exists in the repo. @tleyden can that step be added in? Then the couchbase-server.yaml file will have the proper environment variables populated.

tleyden commented 9 years ago

@bcbroussard done -- https://github.com/couchbase/kubernetes/commit/6c1ea5cbe62a22725ed6fed3383aafb622ff8b0f

derekperkins commented 9 years ago

I got a little farther, but still can't access the admin. I added the app-etcd service, deleted the couchbase-server controllers and recreated them using the default yaml file from the repo.

$ kubectl get endpoints
NAME                   ENDPOINTS
app-etcd-service       10.248.0.3:2380,10.248.0.3:2379
couchbase-service      10.248.0.8:18092,10.248.1.6:18092,10.248.0.8:11207 + 11 more...
kube-dns               10.248.1.3:53,10.248.1.3:53
kubernetes             130.211.166.201:443
sync-gateway-service   10.248.0.6:4984,10.248.0.7:4984
$ kubectl get pod
POD                                                             IP           CONTAINER(S)            IMAGE(S)                                         HOST                                                     LABELS                                                           STATUS    CREATED      MESSAGE
app-etcd                                                        10.248.0.3                                                                            gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252   name=app-etcd                                                    Running   16 hours
                                                                             app-etcd                tleyden5iwx/etcd-discovery                                                                                                                                                 Running   16 hours
couchbase-controller-64643                                      10.248.1.6                                                                            gke-couchbase-server-5ad717cc-node-q8mu/10.240.137.32    name=couchbase-server                                            Running   9 minutes
                                                                             couchbase-sidekick      tleyden5iwx/couchbase-cluster-go:latest                                                                                                                                    Running   9 minutes
                                                                             couchbase-server        couchbase/server:enterprise-4.0.0-beta                                                                                                                                     Running   9 minutes
couchbase-controller-iq7b6                                      10.248.0.8                                                                            gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252   name=couchbase-server                                            Running   9 minutes
                                                                             couchbase-sidekick      tleyden5iwx/couchbase-cluster-go:latest                                                                                                                                    Running   9 minutes
                                                                             couchbase-server        couchbase/server:enterprise-4.0.0-beta                                                                                                                                     Running   9 minutes
fluentd-cloud-logging-gke-couchbase-server-5ad717cc-node-2tr4   10.248.0.2                                                                            gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252   <none>                                                           Running   29 hours
                                                                             fluentd-cloud-logging   gcr.io/google_containers/fluentd-gcp:1.6                                                                                                                                   Running   29 hours
fluentd-cloud-logging-gke-couchbase-server-5ad717cc-node-q8mu   10.248.1.2                                                                            gke-couchbase-server-5ad717cc-node-q8mu/10.240.137.32    <none>                                                           Running   29 hours
                                                                             fluentd-cloud-logging   gcr.io/google_containers/fluentd-gcp:1.6                                                                                                                                   Running   29 hours
kube-dns-v3-u4s0f                                               10.248.1.3                                                                            gke-couchbase-server-5ad717cc-node-q8mu/10.240.137.32    k8s-app=kube-dns,kubernetes.io/cluster-service=true,version=v3   Running   29 hours
                                                                             skydns                  gcr.io/google_containers/skydns:2015-03-11-001                                                                                                                             Running   29 hours
                                                                             kube2sky                gcr.io/google_containers/kube2sky:1.9                                                                                                                                      Running   29 hours
                                                                             etcd                    gcr.io/google_containers/etcd:2.0.9                                                                                                                                        Running   29 hours
sync-gateway-74t4v                                              10.248.0.7                                                                            gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252   name=sync-gateway                                                Running   16 hours
                                                                             sync-gateway            couchbase/sync-gateway                                                                                                                                                     Running   37 seconds   last termination: exit code 1
sync-gateway-cjoxc                                              10.248.0.6                                                                            gke-couchbase-server-5ad717cc-node-2tr4/10.240.197.252   name=sync-gateway                                                Running   16 hours
                                                                             sync-gateway            couchbase/sync-gateway                                                                                                                                                     Running   37 seconds   last termination: exit code 1
$ kubectl logs couchbase-controller-iq7b6 couchbase-sidekick
2015/06/25 00:12:43 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/25 00:12:47 target == SKIP
2015/06/25 00:12:47 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.242.85:2379], target: couchbase-cluster
2015/06/25 00:12:47 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.242.85:2379], target: couchbase-cluster
2015/06/25 00:12:47 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.242.85:2379]
2015/06/25 00:12:47 ExtractEtcdServerList returning: http://10.251.242.85:2379
2015/06/25 00:12:47 args: map[--discover-local-ip:true --k8s-service-name:<nil> get-live-node-ip:false --local-ip:<nil> --etcd-servers:http://10.251.242.85:2379 start-couchbase-sidekick:true remove-and-rebalance:false --help:false wait-until-running:false]
2015/06/25 00:12:47 need to discover local ip..
2015/06/25 00:12:47 Discovered local ip: 10.248.0.8
2015/06/25 00:12:47 Connect to explicit etcd servers: [http://10.251.242.85:2379]
2015/06/25 00:12:47 BecomeFirstClusterNode()
2015/06/25 00:12:47 Created key: /couchbase.com/couchbase-node-state
2015/06/25 00:12:47 Got error Get http://10.248.0.8:8091/pools: dial tcp 10.248.0.8:8091: connection refused trying to fetch details.  Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 00:12:57 Version: 4.0.0-2213-enterprise
2015/06/25 00:12:57 We became first cluster node, init cluster and bucket
2015/06/25 00:12:57 ClusterInit()
2015/06/25 00:12:57 IsClusterPasswordSet()
2015/06/25 00:12:57 ClusterSetPassword()
2015/06/25 00:12:57 Using default username/password
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/settings/web
2015/06/25 00:12:57 Total RAM (MB) on machine: 3711
2015/06/25 00:12:57 Attempting to set cluster ram to: 2783 MB
2015/06/25 00:12:57 Using username/password pulled from etcd
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/pools/default
2015/06/25 00:12:57 CreateBucketWithRetries(): {Name:default RamQuotaMB:128 AuthType:none ReplicaNumber:0}
2015/06/25 00:12:57 Using username/password pulled from etcd
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/pools/default/buckets
2015/06/25 00:12:57 CreateBucket error: Failed to POST to http://10.248.0.8:8091/pools/default/buckets.  Status code: 400.  Body: {"errors":{"proxyPort":"port is already in use"},"summaries":{"ramSummary":{"total":2918187008,"otherBuckets":0,"nodesCount":1,"perNodeMegs":128,"thisAlloc":134217728,"thisUsed":0,"free":2783969280},"hddSummary":{"total":105553100800,"otherData":6333186048,"otherBuckets":0,"thisUsed":0,"free":99219914752}}}
2015/06/25 00:12:57 Sleeping 0 seconds
2015/06/25 00:12:57 Using username/password pulled from etcd
2015/06/25 00:12:57 POST to http://10.248.0.8:8091/pools/default/buckets
2015/06/25 00:12:57 CreateBucket succeeded
2015/06/25 00:12:57 EventLoop()
bcbroussard commented 9 years ago

@derekperkins can you run kubectl describe services couchbase-service You should see your external load balancer ip listed under

LoadBalancer Ingress:   104.154.88.223

Then you should see the admin at http://104.154.88.223:8091

derekperkins commented 9 years ago

I will in 5-10 minutes. I nuked the cluster and am starting from scratch.

derekperkins commented 9 years ago

Getting closer! After getting the LoadBalancer Ingress IP, I visited it and was greeted with the username / password prompt. After logging in, it then redirected me to the standard setup page, not the actual console.

derekperkins commented 9 years ago

Here's the output of kubectl describe services couchbase-service

$ kubectl describe services couchbase-service
W0624 19:19:43.121213   68804 request.go:291] field selector: v1beta3 - events - involvedObject.name - couchbase-service: need to check if this is versioned correctly.
W0624 19:19:43.121341   68804 request.go:291] field selector: v1beta3 - events - involvedObject.namespace - default: need to check if this is versioned correctly.
W0624 19:19:43.121348   68804 request.go:291] field selector: v1beta3 - events - involvedObject.kind - Service: need to check if this is versioned correctly.
W0624 19:19:43.121355   68804 request.go:291] field selector: v1beta3 - events - involvedObject.uid - 51c43579-1ad7-11e5-a755-42010af0d2c1: need to check if this is versioned correctly.
Name:           couchbase-service
Labels:         <none>
Selector:       name=couchbase-server
Type:           LoadBalancer
IP:         10.251.241.7
LoadBalancer Ingress:   104.154.57.215
Port:           webadministrationport   8091/TCP
NodePort:       webadministrationport   30661/TCP
Endpoints:      10.248.0.4:8091,10.248.1.4:8091
Port:           couchbaseapiport    8092/TCP
NodePort:       couchbaseapiport    30892/TCP
Endpoints:      10.248.0.4:8092,10.248.1.4:8092
Port:           internalexternalbucketssl   11207/TCP
NodePort:       internalexternalbucketssl   31701/TCP
Endpoints:      10.248.0.4:11207,10.248.1.4:11207
Port:           internalexternalbucket  11210/TCP
NodePort:       internalexternalbucket  32703/TCP
Endpoints:      10.248.0.4:11210,10.248.1.4:11210
Port:           clientinterfaceproxy    11211/TCP
NodePort:       clientinterfaceproxy    31503/TCP
Endpoints:      10.248.0.4:11211,10.248.1.4:11211
Port:           internalresthttps   18091/TCP
NodePort:       internalresthttps   32345/TCP
Endpoints:      10.248.0.4:18091,10.248.1.4:18091
Port:           internalcapihttps   18092/TCP
NodePort:       internalcapihttps   30695/TCP
Endpoints:      10.248.0.4:18092,10.248.1.4:18092
Session Affinity:   None
No events.
derekperkins commented 9 years ago

Ok, I think I found the problem...

The first node goes up fine

kubectl logs couchbase-controller-8r631 couchbase-sidekick
2015/06/25 01:14:30 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/25 01:14:33 target == SKIP
2015/06/25 01:14:33 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:14:33 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:14:33 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379]
2015/06/25 01:14:33 ExtractEtcdServerList returning: http://10.251.253.222:2379
2015/06/25 01:14:33 args: map[--help:false wait-until-running:false start-couchbase-sidekick:true --discover-local-ip:true remove-and-rebalance:false get-live-node-ip:false --etcd-servers:http://10.251.253.222:2379 --local-ip:<nil> --k8s-service-name:<nil>]
2015/06/25 01:14:33 need to discover local ip..
2015/06/25 01:14:33 Discovered local ip: 10.248.0.4
2015/06/25 01:14:33 Connect to explicit etcd servers: [http://10.251.253.222:2379]
2015/06/25 01:14:33 BecomeFirstClusterNode()
2015/06/25 01:14:33 Created key: /couchbase.com/couchbase-node-state
2015/06/25 01:14:33 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details.  Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:14:43 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details.  Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:14:53 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details.  Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:15:03 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details.  Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:15:13 Got error Get http://10.248.0.4:8091/pools: dial tcp 10.248.0.4:8091: connection refused trying to fetch details.  Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:15:23 Version: 4.0.0-2213-enterprise
2015/06/25 01:15:23 We became first cluster node, init cluster and bucket
2015/06/25 01:15:23 ClusterInit()
2015/06/25 01:15:23 IsClusterPasswordSet()
2015/06/25 01:15:23 ClusterSetPassword()
2015/06/25 01:15:23 Using default username/password
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/settings/web
2015/06/25 01:15:23 Total RAM (MB) on machine: 3711
2015/06/25 01:15:23 Attempting to set cluster ram to: 2783 MB
2015/06/25 01:15:23 Using username/password pulled from etcd
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/pools/default
2015/06/25 01:15:23 CreateBucketWithRetries(): {Name:default RamQuotaMB:128 AuthType:none ReplicaNumber:0}
2015/06/25 01:15:23 Using username/password pulled from etcd
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/pools/default/buckets
2015/06/25 01:15:23 CreateBucket error: Failed to POST to http://10.248.0.4:8091/pools/default/buckets.  Status code: 400.  Body: {"errors":{"proxyPort":"port is already in use"},"summaries":{"ramSummary":{"total":2918187008,"otherBuckets":0,"nodesCount":1,"perNodeMegs":128,"thisAlloc":134217728,"thisUsed":0,"free":2783969280},"hddSummary":{"total":105553100800,"otherData":4222124032,"otherBuckets":0,"thisUsed":0,"free":101330976768}}}
2015/06/25 01:15:23 Sleeping 0 seconds
2015/06/25 01:15:23 Using username/password pulled from etcd
2015/06/25 01:15:23 POST to http://10.248.0.4:8091/pools/default/buckets
2015/06/25 01:15:23 CreateBucket succeeded
2015/06/25 01:15:23 EventLoop()

The second node goes up, then crashes on login

When both controllers looked good, I went to log in (I did this multiple times after deleting and recreating the controllers) and got the login screen. Once, I even got as far as the admin console for about 15 seconds, then it booted me back out into the setup screen. After getting kicked out, this is what the sidekick log looks like

Failed to reach erlang port mapper. Timeout connecting to \"10.248.1.6\" on port \"4369\". This could be due to an incorrect host/port combination or a firewall in place between the servers.

kubectl logs couchbase-controller-v7flt couchbase-sidekick
2015/06/25 01:27:31 Update-Wrapper: updating to latest code
github.com/tleyden/couchbase-cluster-go (download)
github.com/coreos/fleet (download)
github.com/coreos/go-systemd (download)
github.com/tleyden/go-etcd (download)
github.com/docopt/docopt-go (download)
2015/06/25 01:27:34 target == SKIP
2015/06/25 01:27:34 args: [couchbase-cluster start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:27:34 remainingargs: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379], target: couchbase-cluster
2015/06/25 01:27:34 Invoking target: couchbase-cluster with args: [start-couchbase-sidekick --discover-local-ip --etcd-servers http://10.251.253.222:2379]
2015/06/25 01:27:34 ExtractEtcdServerList returning: http://10.251.253.222:2379
2015/06/25 01:27:34 args: map[--help:false --etcd-servers:http://10.251.253.222:2379 --discover-local-ip:true --k8s-service-name:<nil> remove-and-rebalance:false wait-until-running:false start-couchbase-sidekick:true --local-ip:<nil> get-live-node-ip:false]
2015/06/25 01:27:34 need to discover local ip..
2015/06/25 01:27:34 Discovered local ip: 10.248.1.6
2015/06/25 01:27:34 Connect to explicit etcd servers: [http://10.251.253.222:2379]
2015/06/25 01:27:34 BecomeFirstClusterNode()
2015/06/25 01:27:34 Key /couchbase.com/couchbase-node-state already exists
2015/06/25 01:27:34 Got error Get http://10.248.1.6:8091/pools: dial tcp 10.248.1.6:8091: connection refused trying to fetch details.  Assume that the cluster is not up yet, sleeping and will retry
2015/06/25 01:27:44 Version: 4.0.0-2213-enterprise
2015/06/25 01:27:44 JoinExistingCluster() called
2015/06/25 01:27:44 Calling FindLiveNode()
2015/06/25 01:27:44 Couchbase node ip: 10.248.0.4
2015/06/25 01:27:44 Verifying REST service at http://10.248.0.4:8091/ to be up
2015/06/25 01:27:44 liveNodeIp: 10.248.0.4
2015/06/25 01:27:44 JoinLiveNode() called with 10.248.0.4
2015/06/25 01:27:44 GetClusterNodes() called with: 10.248.0.4
2015/06/25 01:27:44 No cluster node found for 10.248.1.6.  Not retrying
2015/06/25 01:27:44 WaitUntilInClusterAndHealthy() returned error: Unable to find node with hostname 10.248.1.6 in [map[services:[kv] otpCookie:hzmbyaycqepsfmxj ports:map[sslProxy:11214 httpsMgmt:18091 httpsCAPI:18092 proxy:11211 direct:11210] couchApiBase:http://10.248.0.4:8092/ couchApiBaseHTTPS:https://10.248.0.4:18092/ otpNode:ns_1@127.0.0.1 thisNode:true clusterCompatibility:262144 version:4.0.0-2213-enterprise interestingStats:map[curr_items:0 curr_items_tot:0 ep_bg_fetched:0 ops:0 vb_replica_curr_items:0 cmd_get:0 couch_docs_actual_disk_size:4.257466e+06 couch_docs_data_size:4.241408e+06 couch_views_actual_disk_size:0 couch_views_data_size:0 get_hits:0 mem_used:3.8265264e+07] uptime:747 clusterMembership:active recoveryType:none hostname:10.248.0.4:8091 os:x86_64-unknown-linux-gnu systemStats:map[mem_free:3.047079936e+09 cpu_utilization_rate:3.092783505154639 swap_total:0 swap_used:0 mem_total:3.892027392e+09] memoryTotal:3.892027392e+09 mcdMemoryAllocated:2969 status:healthy memoryFree:3.047079936e+09 mcdMemoryReserved:2969]].  Call AddNodeRetry()
2015/06/25 01:27:44 AddNode()
2015/06/25 01:27:44 AddNode posting to http://10.248.0.4:8091/controller/addNode with data: hostname=10.248.1.6&password=7vmRpf9wZLQwkvbc&user=cbadmin
2015/06/25 01:27:44 Using username/password pulled from etcd
2015/06/25 01:27:44 POST to http://10.248.0.4:8091/controller/addNode
2015/06/25 01:27:49 AddNode failed with err: Failed to POST to http://10.248.0.4:8091/controller/addNode.  Status code: 400.  Body: ["Failed to reach erlang port mapper. Timeout connecting to \"10.248.1.6\" on port \"4369\".  This could be due to an incorrect host/port combination or a firewall in place between the servers."].  Will retry in 10 secs
2015/06/25 01:27:59 AddNode()
2015/06/25 01:27:59 AddNode posting to http://10.248.0.4:8091/controller/addNode with data: hostname=10.248.1.6&password=7vmRpf9wZLQwkvbc&user=cbadmin
2015/06/25 01:27:59 Using username/password pulled from etcd
2015/06/25 01:27:59 POST to http://10.248.0.4:8091/controller/addNode
2015/06/25 01:28:04 AddNode failed with err: Failed to POST to http://10.248.0.4:8091/controller/addNode.  Status code: 400.  Body: ["Failed to reach erlang port mapper. Timeout connecting to \"10.248.1.6\" on port \"4369\".  This could be due to an incorrect host/port combination or a firewall in place between the servers."].  Will retry in 20 secs
2015/06/25 01:28:24 AddNode()
2015/06/25 01:28:24 AddNode posting to http://10.248.0.4:8091/controller/addNode with data: hostname=10.248.1.6&password=7vmRpf9wZLQwkvbc&user=cbadmin
2015/06/25 01:28:24 Using username/password pulled from etcd
2015/06/25 01:28:24 POST to http://10.248.0.4:8091/controller/addNode
2015/06/25 01:28:29 AddNode failed with err: Failed to POST to http://10.248.0.4:8091/controller/addNode.  Status code: 400.  Body: ["Failed to reach erlang port mapper. Timeout connecting to \"10.248.1.6\" on port \"4369\".  This could be due to an incorrect host/port combination or a firewall in place between the servers."].  Will retry in 30 secs
$ kubectl get endpoints
NAME                ENDPOINTS
app-etcd-service    10.248.1.3:2380,10.248.1.3:2379
couchbase-service   10.248.0.4:18092,10.248.1.6:18092,10.248.0.4:11207 + 11 more...
kube-dns            10.248.0.3:53,10.248.0.3:53
kubernetes          130.211.147.104:443
tleyden commented 9 years ago

So 10.248.1.6 is trying to join itself to existing cluster member 10.248.0.4, but when 10.248.0.4 tries to callback to 10.248.1.6 on port 4369 it fails.

I would suggest "docker exec'ing" into the couchbase container running on 10.248.0.4, and then ping 10.248.1.6 to make sure there is a route, and then try running telnet 10.248.1.6 4369.

Another thing you could try is "docker exec'ing" into the couchbase container running on 10.248.1.6, and try running telnet localhost 4369

derekperkins commented 9 years ago

I'm not able to get into either of those containers for some weird reason. It's telling me that the host doesn't exist when running kubectl exec, and it's throwing a TLS error on docker exec from inside the VM.

Instances are up and running

$ gcloud compute instances list
NAME                                    ZONE          MACHINE_TYPE  PREEMPTIBLE INTERNAL_IP         STATUS
gke-couchbase-server-96b2a542-node-6sep us-central1-b n1-standard-1             10.240.13.115    RUNNING
gke-couchbase-server-96b2a542-node-c278 us-central1-b n1-standard-1             10.240.70.131    RUNNING

Logging into 10.248.0.4

$ kubectl exec couchbase-controller-8r631 bash
error: Unable to upgrade connection: {
  "kind": "Status",
  "apiVersion": "v1beta3",
  "metadata": {},
  "status": "Failure",
  "message": "dial tcp: lookup gke-couchbase-server-96b2a542-node-6sep: no such host",
  "code": 500
}

Logging into 10.248.1.6

kubectl exec couchbase-controller-v7flt bash
error: Unable to upgrade connection: {
  "kind": "Status",
  "apiVersion": "v1beta3",
  "metadata": {},
  "status": "Failure",
  "message": "dial tcp: lookup gke-couchbase-server-96b2a542-node-c278: no such host",
  "code": 500
}

After gcloud compute ssh

$ gcloud compute ssh gke-couchbase-server-96b2a542-node-c278
...
$ docker exec 10.248.0.4 bash
FATA[0000] Post http:///var/run/docker.sock/v1.18/containers/10.248.0.4/exec: dial unix /var/run/docker.sock: permission denied. Are you trying to connect to a TLS-enabled daemon without TLS?
tleyden commented 9 years ago

What about sudo docker exec 10.248.0.4 bash?

derekperkins commented 9 years ago

The command worked this time, but I logged into both VMs and neither one was able to connect.

$ sudo docker exec 10.248.0.4 bash
FATA[0000] Error response from daemon: no such id: 10.248.0.4
bcbroussard commented 9 years ago

@tleyden - I noticed that port 4369 is not defined in the couchbase-service.yaml, and it's not exposed in the coucbase-server.yaml file either. Unless you can think of something else, it seems like that might be the culprit.

tleyden commented 9 years ago

I thought the couchbase containers would be directly communicating between themselves, and not through a service? (and therefore, couchbase-service.yaml shouldn't need port 4369 to be defined)

For the containerPort parameter in coucbase-server.yaml, is that saying that only that port will be visible to other containers? If so, we'll need to fix that, since couchbase server has a bunch of ports

tleyden commented 9 years ago

@derekperkins oh, with docker exec you need to pass the container id. If you first run docker ps, you can get the container id. Then run sudo docker exec -ti container_id bash

derekperkins commented 9 years ago

@tleyden That got me into the container, and when I ran ping 10.248.1.6, it found it. I did get an error: bash: telnet: command not found, so I couldn't try out the second part.

tleyden commented 9 years ago

You can just install telnet via yum install -y telnet

derekperkins commented 9 years ago

Ok, that connected.

[root@couchbase-controller-v7flt /]# telnet 10.248.1.6 4369
Trying 10.248.1.6...
Connected to 10.248.1.6.
derekperkins commented 9 years ago

I logged into the other one and it worked too, though not on ipv6.

[root@couchbase-controller-8r631 /]# telnet localhost 4369
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
derekperkins commented 9 years ago

I tried again from scratch, wondering if I had messed something up logging into the servers and it's still the same story. I get all the way through a successful login, it flashes the admin panel and then kicks me out.

Is the problem that it can't create the default bucket?

015/06/25 21:01:45 POST to http://10.212.0.4:8091/pools/default/buckets
2015/06/25 21:01:45 CreateBucket error: Failed to POST to http://10.212.0.4:8091/pools/default/buckets.  Status code: 400.  Body: {"errors":{"proxyPort":"port is already in use"},"summaries":{"ramSummary":{"total":1340080128,"otherBuckets":0,"nodesCount":1,"perNodeMegs":128,"thisAlloc":134217728,"thisUsed":0,"free":1205862400},"hddSummary":{"total":105553100800,"otherData":3166593024,"otherBuckets":0,"thisUsed":0,"free":102386507776}}}
bcbroussard commented 9 years ago

@derekperkins the logout issue is probably caused by the service alternating requests between the 2 pods. In that case, you can try scaling down to 1 pod, making config changes, and then scaling back up.

kubectl scale --replicas=1 rc couchbase-controller
#access web ui
kubectl scale --replicas=2 rc couchbase-controller
bcbroussard commented 9 years ago

@tleyden yes, if the port needs to be external then it should go into the service - like the Web Administration Port. If they are internal only, then they just need to be defined in the container (with expose).

I can think of 2 ways to fix the service load balancing issue.

  1. Create a couchbase-server-master.yaml, include a label master and add a selector to the service. There will only be 1 master. Then create a couchbase-server-slave.yaml which can be scaled up or down.
  2. We could also access each pod directly using the proxy and configure it that way without the need of a service, but the web ui doesn't seem to work without being at a root url. This might require some changes to the admin ui.
#Accessing a pod directly
$ kubectl get pods 
#Find the name of couchbase pod #1 and replace it in the link below. 
# Open this link in your browser.
https://<K8s master>/api/v1beta3/proxy/namespaces/default/pods/couchbase-controller-mnke4:8091/
derekperkins commented 9 years ago

When I scale back up, the brief looks I get at the console now show 3 servers being down, even though there were only 2 to begin with, and at least 1 or 2 are running because it's booting me out.

derekperkins commented 9 years ago

@bcbroussard At the beginning of this process, I was able to login to the web ui directly via any of the box external IPs, but I can't now.

Without the web ui, how can I check the server status? I have no way of knowing how many servers it thinks is online or whether rebalancing works properly after scaling the system back up.

derekperkins commented 9 years ago

I'm not trying to be a pest, just wondering what the timeline looks for getting this issue fixed.

tleyden commented 9 years ago

@derekperkins np, thanks for checking in. I can't make any timeline promises, but I will try to take a look at this as soon as I can.

derekperkins commented 9 years ago

Thanks for that. To get something up and running in the short term, should I just look at throwing CB 4.0 onto Compute Engine while GKE and this project mature? Is your article on it still valid a year later? http://tleyden.github.io/blog/2014/06/22/running-couchbase-server-on-gce/?