alphagov / govuk-docker

GOV.UK development environment using Docker 🐳
MIT License
80 stars 22 forks source link

Router API cannot connect to Mongo 2.6 #533

Open huwd opened 2 years ago

huwd commented 2 years ago

We've encountered a chaining problem when looking into how publishing-api tries to put things onto the rabbitMQ, which we've traced: Publishing API -> Content Store -> Router API -> Router

The problem seems to be that router-api cannot find a server:

To replicate:

➜  router-api git:(main) govuk-docker-run bundle exec rails c
docker-compose -f [...] run router-api-lite bundle exec rails c
Creating govuk-docker_router-api-lite_run ... done
Loading development environment (Rails 6.0.3.7)
irb(main):001:0> Route.count
Traceback (most recent call last):
        1: from (irb):1
Mongo::Error::NoServerAvailable (No primary server is available in cluster: #<Cluster topology=Unknown[mongo-2.6:27017] servers=[#<Server address=mongo-2.6:27017 GHOST>]> with timeout=30, LT=0.015)

The mongo container does run, and you can watch logs though it is in a big loop of opening and closing connections punctuated by the following failry suspect message:

2021-12-03T16:06:57.809+0000 [rsStart] warning: getaddrinfo("48703775aaf0") failed: Name or service not known
2021-12-03T16:06:57.846+0000 [rsStart] getaddrinfo("48703775aaf0") failed: Name or service not known
2021-12-03T16:06:57.846+0000 [rsStart] replSet info Couldn't load config yet. Sleeping 20sec and will try again.

@kevindew spotted that if we comment out this line things start working again.

That seems to have been introduced during work to to resolve differences in how rs.status responds between mongo v.2.6 (which router runs in prod) and more modern versions.

https://github.com/alphagov/govuk-docker/pull/499

This may have been an attempt to resolve this issue: https://github.com/alphagov/router/issues/210

Question to answer: what was L46 trying to resolve? Does it still serve that purpose? Can we replace it with something doesn't block local dev, or remove it altogether?

karlbaker02 commented 2 years ago

L46 is necessary as we have been running MongoDB as a replica set since around April 2021, in order to enable the app to be replatformed. Previously, Router API knew about all running Router instances and would, upon a request to update a route, update said route and then call the /reload endpoint on each and every Router instance in order to ensure each instance's routes were up-to-date.

Replatforming changed this behaviour so that instead of Router API needing to know about individual Router instances (hardcoded instances, which was not translatable into the Kubernetes world into which we're now moving), Router instances would instead poll MongoDB for any new changes every few seconds; the way that we enabled this was through the use of a replica set and the db.stats() method to determine whether an instance has an up-to-date copy of the current routes from MongoDB by comparing the current optime to it's cached optime and reloading if changes have occurred.