autopilotpattern / mongodb

A robust and highly-scalable implementation of MongoDB in Docker using the Autopilot Pattern
Mozilla Public License 2.0
42 stars 20 forks source link

Replicaset on Joyent - replicas not connecting #12

Closed sberryman closed 7 years ago

sberryman commented 7 years ago

So I have MongoDB replicasets running locally on Docker for Mac without using network: bridge however, when running on Joyent I can't seem to get the replica to connect and sync with the primary.

  1. Pulled down the repo
  2. Updated docker-compose.yml (see gist)
  3. Started it docker-compose -p mdb_test -f docker-compose.yml up -d
  4. Waited for consul and mongodb to start and then ran docker logs mdbtest_mongodb_1 (see gist)
  5. It took a little white but it managed to create the mongodb-primary key and lock session in consul
  6. Then I ran rs.status() on the only running mongodb node (see gist)
  7. Now I scaled up by 1 node to have 2 running mongodb nodes. docker-compose -p mdb_test -f docker-compose.yml scale mongodb=2
  8. Waited for mongodb2 to show up (see gist)
  9. Ran rs.status() on the primary again. (see gist)

docker ps:

CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS              PORTS                                                                                NAMES
56cedfc2fdbf        autopilotpattern/mongodb   "containerpilot mo..."   5 minutes ago       Up 5 minutes        0.0.0.0:27017->27017/tcp                                                             mdbtest_mongodb_2
2a3303edc766        autopilotpattern/mongodb   "containerpilot mo..."   15 minutes ago      Up 14 minutes       0.0.0.0:27017->27017/tcp                                                             mdbtest_mongodb_1
e5a1f80f8cb1        consul:0.7.5               "docker-entrypoint..."   17 minutes ago      Up 17 minutes       8300-8302/tcp, 8400/tcp, 8301-8302/udp, 8600/tcp, 8600/udp, 0.0.0.0:8500->8500/tcp   mdbtest_consul_1

I've tried with and without exposing any ports publicly and with and without using the consul agent. Only thing I can think of at this point is possibly around DNS. I see the redis autopilot pattern is using a very different method to get the IP address of the container.

MongoDB: https://github.com/autopilotpattern/mongodb/blob/master/bin/manage.py#L420-L430 Redis: https://github.com/autopilotpattern/redis/blob/master/bin/manage.sh#L311-L313

I figure I have to be doing something wrong as nobody else has mentioned a problem...

tgross commented 7 years ago

Yeah if you dig into your logs there's a DNS lookup for consul-mdb.svc.{{ACCOUNT_ID}}.us-sw-1.cns.joyent.com happening, which looks to be mismatched with this line of your Compose file.

This means none of your mongo instances are finding Consul, so they can't do service discovery.

sberryman commented 7 years ago

I just masked out the account id. When I actually ran the compose file I used the FQDN. If you look at the logs you can see it actually picked up and registered with the consul server. It is creating the mongodb-primary key and lockset just fine.

tgross commented 7 years ago

It seems to be having some trouble reaching Consul here https://gist.github.com/sberryman/0c61b319e57de014fc43e8af414a14ee#file-mongodb_1-log-L158 but I guess it looks like it recovers later on. We keep trying to create the session though and getting errors later down the road. This doesn't seem like the correct behavior but I'll admit it's been a bit since I've looked at this code.

It also looks like there's a stack trace being dropped in the log for the on_change handler:

2017/04/04 17:42:48 2017-04-04 17:42:48,315 INFO manage.py Function failed on_change
2017/04/04 17:42:48     msg = self.format(record)
2017/04/04 17:42:48   File "/usr/lib/python2.7/logging/__init__.py", line 732, in format
2017/04/04 17:42:48     return fmt.format(record)
2017/04/04 17:42:48   File "/usr/lib/python2.7/logging/__init__.py", line 471, in format
2017/04/04 17:42:48     record.message = record.getMessage()
2017/04/04 17:42:48   File "/usr/lib/python2.7/logging/__init__.py", line 335, in getMessage
2017/04/04 17:42:48     msg = msg % self.args
2017/04/04 17:42:48 TypeError: not all arguments converted during string formatting
2017/04/04 17:42:48 Logged from file manage.py, line 222

Which is here https://github.com/autopilotpattern/mongodb/blob/master/bin/manage.py#L222

    try:
        repl_status = local_mongo.admin.command('replSetGetStatus')
        is_mongo_primary = repl_status['myState'] == 1
        # ref https://docs.mongodb.com/manual/reference/replica-states/
    except Exception as e:
        log.error(e, 'unable to get primary status')
        return False

That's not valid Python for the log bit, so we need to fix that but I don't think that's the problem either.

I'm not sure where this error message is coming from, as that log string doesn't appear in the code as far as I can tell. What version are you using?

sberryman commented 7 years ago

Haha I noticed that error as well and couldn't find it in the project. "DEBUG manage.py no replset config has been received"

I tried using autopilotpattern/mongodb:latest as well as pulling down the repo, building from master. Then I just tested upgrading containerpilot to 2.7.2 as I saw you pushed that today. Containerpilot version didn't see to do anything. I'm using the global _/consul:0.7.5 but I've also tried with autopilotpattern/consul:0.7.2-r0.8

So I just tried to modify the manage.py file and change the location of consul. I noticed on other projects you have the concept of CONSUL_AGENT=1 which doesn't exist as part of this example. So then I realized that the agent coprocess has been added but the manage.py doesn't point to localhost:8500 consul, it points to whatever is specified as the env var for CONSUL. Changing to localhost:8500 doesn't fix it.

Everything seems to be syncing to consul just fine though:

screen shot 2017-04-04 at 12 35 53 pm screen shot 2017-04-04 at 12 36 03 pm
sberryman commented 7 years ago

After some google searching it turns out no replset config has been received error is coming from MongoDB itself. Most of the errors look like they are related to hostname/port combinations.

sberryman commented 7 years ago

I just tried upgrading the python modules with no luck either:

PyMongo: 3.2.2 -> 3.4.0
python-Consul: 0.4.7 -> 0.7.0

It seems to be failing at repl_status = local_mongo.admin.command('replSetGetStatus') https://github.com/autopilotpattern/mongodb/blob/master/bin/manage.py#L110

Update: I've updated the gist to add two new files.

mdbtest_mongodb_1.log mdbtest_mongodb_2.log

sberryman commented 7 years ago

At this point I believe you are correct and the logging error isn't causing any issues.

It seems like GET /v1/agent/services doesn't appear to be returning the correct services. https://github.com/autopilotpattern/mongodb/blob/master/bin/manage.py#L281-L283

docker exec -it mdbtest_mongodb_1 bash

curl http://localhost:8500/v1/agent/services

{
    "mongodb-replicaset-8921a0c3da47": {
        "ID": "mongodb-replicaset-8921a0c3da47",
        "Service": "mongodb-replicaset",
        "Tags": null,
        "Address": "192.168.128.68",
        "Port": 27017,
        "EnableTagOverride": false,
        "CreateIndex": 0,
        "ModifyIndex": 0
    }
}

You mention using consul.agent.health() instead which when testing against the consul local agent on the primary node I get:

curl http://localhost:8500/v1/health/service/mongodb-replicaset?passing=true

[{
    "Node": {
        "ID": "",
        "Node": "588d353877e6",
        "Address": "192.168.128.69",
        "TaggedAddresses": {
            "lan": "192.168.128.69",
            "wan": "192.168.128.69"
        },
        "Meta": null,
        "CreateIndex": 119,
        "ModifyIndex": 218
    },
    "Service": {
        "ID": "mongodb-replicaset-588d353877e6",
        "Service": "mongodb-replicaset",
        "Tags": null,
        "Address": "192.168.128.69",
        "Port": 27017,
        "EnableTagOverride": false,
        "CreateIndex": 124,
        "ModifyIndex": 127
    },
    "Checks": [{
        "Node": "588d353877e6",
        "CheckID": "mongodb-replicaset-588d353877e6",
        "Name": "mongodb-replicaset-588d353877e6",
        "Status": "passing",
        "Notes": "TTL for mongodb-replicaset set by containerpilot",
        "Output": "ok",
        "ServiceID": "mongodb-replicaset-588d353877e6",
        "ServiceName": "mongodb-replicaset",
        "CreateIndex": 126,
        "ModifyIndex": 127
    }, {
        "Node": "588d353877e6",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "CreateIndex": 119,
        "ModifyIndex": 119
    }]
}, {
    "Node": {
        "ID": "",
        "Node": "8921a0c3da47",
        "Address": "192.168.128.68",
        "TaggedAddresses": {
            "lan": "192.168.128.68",
            "wan": "192.168.128.68"
        },
        "Meta": null,
        "CreateIndex": 73,
        "ModifyIndex": 222
    },
    "Service": {
        "ID": "mongodb-replicaset-8921a0c3da47",
        "Service": "mongodb-replicaset",
        "Tags": null,
        "Address": "192.168.128.68",
        "Port": 27017,
        "EnableTagOverride": false,
        "CreateIndex": 77,
        "ModifyIndex": 80
    },
    "Checks": [{
        "Node": "8921a0c3da47",
        "CheckID": "mongodb-replicaset-8921a0c3da47",
        "Name": "mongodb-replicaset-8921a0c3da47",
        "Status": "passing",
        "Notes": "TTL for mongodb-replicaset set by containerpilot",
        "Output": "ok",
        "ServiceID": "mongodb-replicaset-8921a0c3da47",
        "ServiceName": "mongodb-replicaset",
        "CreateIndex": 79,
        "ModifyIndex": 80
    }, {
        "Node": "8921a0c3da47",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "CreateIndex": 73,
        "ModifyIndex": 73
    }]
}]

Which clearly gives me BOTH nodes. This has been like searching a needle in a haystack but a good opportunity to see how this all works. I'm going to take a break and then try and tackle the change over to using health.

Not sure how anyone is using this on Joyent as is right now though...

sberryman commented 7 years ago

@tgross need me to make any changes to the PR #13?