apache / couchdb-docker

Semi-official Apache CouchDB Docker images
https://github.com/apache/couchdb-docker
Apache License 2.0
257 stars 133 forks source link

Cluster working with IP address but not FQDN #214

Closed rrag closed 2 years ago

rrag commented 2 years ago

Expected Behavior

When the DNS is properly configured on the docker using --dns=x.x.x.x the cluster set up should be success similar to using IP address

Even when running docket with --dns attribute the NODENAME does not work when a FQDN is present.

Current Behavior

Using IP address in the NODENAME leads to sucessful cluster setup, but when using FQDN cluster set up times out

{"error":"setup_error","reason":"Cluster setup timed out waiting for nodes to connect"}

Possible Solution

I have sshed into the docker container and I am able to nslookup the DNS.

in the documentation it is mentioned

Tricks with /etc/hosts and libresolv don’t work with Erlang. Either properly set up DNS and use fully-qualified domain names, or use IP addresses. DNS and FQDNs are preferred.

I have a properly configured bind and there are no tricks there.

Steps to Reproduce (for bugs)

When using this the cluster setup is complete

docker run \
  --rm \
  --name prodb_couch \
  --dns=10.139.200.202 \
  --mount type=bind,source=${HOME}/prodb/data/couchdb,target=/opt/couchdb/data \
  --mount type=bind,source=${HOME}/prodb/log/couchdb,target=/opt/couchdb/var/log \
  --mount type=bind,source=${HOME}/prodb/config/couchdb/local.d,target=/opt/couchdb/etc/local.d \
  -e ERL_FLAGS="-setcookie brumbrum" \
  -e NODENAME=10.139.200.208 \
  -p 5984:5984 \
  -p 4369:4369 \
  -p 9100:9100 \
  apache/couchdb:3.2.1

But when using this unable to complete the cluster set up and I get the error

{"error":"setup_error","reason":"Cluster setup timed out waiting for nodes to connect"}
docker run \
  --rm \
  --name prodb_couch \
  --dns=10.139.200.202 \
  --mount type=bind,source=${HOME}/prodb/data/couchdb,target=/opt/couchdb/data \
  --mount type=bind,source=${HOME}/prodb/log/couchdb,target=/opt/couchdb/var/log \
  --mount type=bind,source=${HOME}/prodb/config/couchdb/local.d,target=/opt/couchdb/etc/local.d \
  -e ERL_FLAGS="-setcookie brumbrum" \
  -e NODENAME=api01.prod.REDACTED \
  -p 5984:5984 \
  -p 4369:4369 \
  -p 9100:9100 \
  apache/couchdb:3.2.1

Context

I have a private dns server set up and all my instances are configured to use this as the nameserver. It works well with all apps and even within the docker container I can resolve these names correctly

Your Environment

kocolosk commented 2 years ago

Interesting. One thing you can try is to skip setting NODENAME altogether and instead just set the name in ERL_FLAGS:

  -e ERL_FLAGS="-setcookie brumbrum -name couchdb" \

This will cause the Erlang VM to try to determine the FQDN of the container when it starts up and use that for the nodename. If it can't determine the FQDN the VM should crash on startup. This is how we start CouchDB when it's installed in Kubernetes using the couchdb-helm chart, and it definitely does use FQDNs correctly there with this same container image.

rrag commented 2 years ago

Yes it does crash when I use just -name couchdb

$ docker run \
  --rm \
  --name prodb_couch \
  --dns=10.139.200.202 \
  --mount type=bind,source=${HOME}/prodb/data/couchdb,target=/opt/couchdb/data \
  --mount type=bind,source=${HOME}/prodb/config/couchdb/local.d,target=/opt/couchdb/etc/local.d \
  -e ERL_FLAGS="-setcookie brumbrum -name couchdb" \
  -p 5984:5984 \
  -p 4369:4369 \
  -p 9100:9100 \
  apache/couchdb:3.2.1

2022-01-13 16:21:37 Can't set long node name!
Please check your configuration

To debug this I also did the following Ran the erlang docker container on the 3 nodes

docker run -it \
  --dns=10.139.200.202 \
  -p 4369:4369 \
  -p 9100:9100 \
 --rm erlang /bin/sh

Then on node 1

erl -name bus@api01.prod.blr1.gc -setcookie 'brumbrum' -kernel inet_dist_listen_min 9100 -kernel inet_dist_listen_max 9100

on node 2

erl -name car@api02.prod.blr1.gc -setcookie 'brumbrum' -kernel inet_dist_listen_min 9100 -kernel inet_dist_listen_max 9100

on node 3

erl -name van@api03.prod.blr1.gc -setcookie 'brumbrum' -kernel inet_dist_listen_min 9100 -kernel inet_dist_listen_max 9100

and from node 1

# net_kernel:connect_node('car@api02.prod.blr1.gc').
true

so connectivity using erl works

Would it be helpful if I record a video of these steps?

rrag commented 2 years ago

when I tried to start the docker with

docker run -it \
  --rm \
  --name prodb_couch \
  --mount type=bind,source=${HOME}/prodb/data/couchdb,target=/opt/couchdb/data \
  --mount type=bind,source=${HOME}/prodb/config/couchdb/local.d,target=/opt/couchdb/etc/local.d \
  -e ERL_FLAGS="-setcookie brumbrum -name ts@10.139.200.203" \
  -p 5984:5984 \
  -p 4369:4369 \
  -p 9100:9100 \
  apache/couchdb:3.2.1

notice -name ts@ipaddress instead of -name couchdb@ipaddress

even then the cluster creation fails with

{"error":"setup_error","reason":"Cluster setup timed out waiting for nodes to connect"}

but when I do -name couchdb@ipaddress cluster is created successfully.

so here is what actually works

couchdb@ipaddress

these below names lead to Cluster setup timed out waiting for nodes to connect

notcouchdb@ipaddress
couchdb@f.q.d.n
notcouchdb@f.q.d.n

Here are the contents of

local.d
user.ini
-------
[admins]
admin = -pbkdf2-REDACTED
couch_user = -pbkdf2-REDACTED

[couchdb]
uuid = e697b5ff329cea4b410b4ee62980fc6d

[chttpd_auth]
secret = supersecret
rrag commented 2 years ago

ok I have made some more progress and found a mistake in my steps. I am able to do cluster with FQDN, but the name always has to be couchdb@... anything other than that cluster will not finish

here are the steps for anyone else facing this problem

I have 3 droplets in digitalocean

here are their ip addresses and the FQDN using a private dns server I have configured

COUCH_NODE1=10.139.200.203 api01.prodb.blr1.gc
COUCH_NODE2=10.139.200.208 api02.prodb.blr1.gc
COUCH_NODE2=10.139.200.209 api03.prodb.blr1.gc

Now I run this docker command

# on node 1
docker run \
  --rm \
  --dns=10.139.200.202 \
  --name prodb_couch \
  --mount type=bind,source=${HOME}/prodb/data/couchdb,target=/opt/couchdb/data \
  --mount type=bind,source=${HOME}/prodb/config/couchdb/local.d,target=/opt/couchdb/etc/local.d \
  -e ERL_FLAGS="-setcookie brumbrum -name couchdb@api01.prod.blr1.gc" \
  -p 5984:5984 \
  -p 4369:4369 \
  -p 9100:9100 \
  apache/couchdb:3.2.1

# on node 2
docker run \
  --rm \
  --dns=10.139.200.202 \
  --name prodb_couch \
  --mount type=bind,source=${HOME}/prodb/data/couchdb,target=/opt/couchdb/data \
  --mount type=bind,source=${HOME}/prodb/config/couchdb/local.d,target=/opt/couchdb/etc/local.d \
  -e ERL_FLAGS="-setcookie brumbrum -name couchdb@api02.prod.blr1.gc" \
  -p 5984:5984 \
  -p 4369:4369 \
  -p 9100:9100 \
  apache/couchdb:3.2.1 

# on node 3
docker run \
  --rm \
  --dns=10.139.200.202 \
  --name prodb_couch \
  --mount type=bind,source=${HOME}/prodb/data/couchdb,target=/opt/couchdb/data \
  --mount type=bind,source=${HOME}/prodb/config/couchdb/local.d,target=/opt/couchdb/etc/local.d \
  -e ERL_FLAGS="-setcookie brumbrum -name couchdb@api03.prod.blr1.gc" \
  -p 5984:5984 \
  -p 4369:4369 \
  -p 9100:9100 \
  apache/couchdb:3.2.1

the --dns=10.139.200.202 \ is important FQDN with private DNS is not working without that

my local.d folder has only one file

user.ini
-------
[admins]
admin = -pbkdf2-REDACTED
couch_user = -pbkdf2-REDACTED

[couchdb]
uuid = e697b5ff329cea4b410b4ee62980fc6d

[chttpd_auth]
secret = supersecret

Now once the 3 nodes have started I run these commands

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "enable_cluster", "bind_address":"0.0.0.0", "username": "admin", "password":"admin_password", "node_count":"3"}'
{"error":"bad_request","reason":"Cluster is already enabled"}

^ always returns an error that cluster is already enabled. for the longest time I was not sure what to do. then I just ignored this step and proceeded to the next

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "enable_cluster", "bind_address":"0.0.0.0", "username": "admin", "password":"admin_password", "port": 5984, "node_count": "3", "remote_node": "api02.prod.blr1.gc", "remote_current_user": "admin", "remote_current_password": "admin_password" }'

{"ok":true}

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "add_node", "host":"api02.prod.blr1.gc", "port": 5984, "username": "admin", "password":"admin_password"}'

{"ok":true}

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "enable_cluster", "bind_address":"0.0.0.0", "username": "admin", "password":"admin_password", "port": 5984, "node_count": "3", "remote_node": "api03.prod.blr1.gc", "remote_current_user": "admin", "remote_current_password": "admin_password" }'

{"ok":true}

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "add_node", "host":"api03.prod.blr1.gc", "port": 5984, "username": "admin", "password":"admin_password"}'

{"ok":true}

sleep 4

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "finish_cluster"}'

{"ok":true}

The problem I had was the in the /_cluster_setup path I was always passing the remote ip in remote_node and host now I changed that to pass the FQDN in these remote_node and host

now finally

curl http://admin:startup2017@api02.prodb.blr1.gc:5984/_membership | jq

{
  "all_nodes": [
    "couchdb@api01.prod.blr1.gc",
    "couchdb@api02.prod.blr1.gc",
    "couchdb@api03.prod.blr1.gc"
  ],
  "cluster_nodes": [
    "couchdb@api01.prod.blr1.gc",
    "couchdb@api02.prod.blr1.gc",
    "couchdb@api03.prod.blr1.gc"
  ]
}

I still do not know how to not have couchdb@ because if I change to anything else I get the error during last step of the _cluster_setup

{"error":"setup_error","reason":"Cluster setup timed out waiting for nodes to connect"}

Could anyone suggest what I am missing for that

rrag commented 2 years ago

WIth help from the nice folks in the couchdb slack channel I got this resolved. the solution is to add a "name": "notcouchdb" to the "add_node"

e.g.

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "add_node", "host":"api02.prod.blr1.gc", "name", "couch01", "port": 5984, "username": "admin", "password":"admin_password"}'

{"ok":true}

curl -X POST -H "Content-Type: application/json" \
  http://admin:admin_password@api01.prodb.blr1.gc:5984/_cluster_setup \
  -d '{"action": "add_node", "host":"api03.prod.blr1.gc", "name", "couch01", "port": 5984, "username": "admin", "password":"admin_password"}'

{"ok":true}

Once this additional "name" is added the cluster gets created properly

kocolosk commented 2 years ago

Thanks for following up and posting the final resolution!