dunglas / mercure

🪽 An open, easy, fast, reliable and battery-efficient solution for real-time communications
https://mercure.rocks
GNU Affero General Public License v3.0
3.91k stars 290 forks source link

Unable to upgrade an API-Platform/FrankenPHP/Mercure Docker Swarm service without downtime #898

Open toby-griffiths opened 5 months ago

toby-griffiths commented 5 months ago

I think that this is a Mercure issue, but please correct me if I'm wrong…

We have just deployed an API-Platform based project to a Docker Swarm and it's working nicely, however when we attempt to update the services, the first update attempt seems to always fail, with the following error appearing in the logs…

Error: loading initial config: loading new config: loading http app module: provision http: server srv0: setting up route handlers: route 0: loading handler modules: position 0: loading module 'subroute': provision http.handlers.subroute: setting up subroutes: route 0: loading handler modules: position 4: loading module 'mercure': provision http.handlers.mercure: "bolt:///data/mercure.db?subscriptions=1": invalid transport: timeout

If we re-run the same docker stack update command the existing service appears to stop, and the API goes offline for a brief period while the new service starts up, and then everything works again.

Is this caused by some form of locking on the Mercure data store? Is there a way around this?

I've briefly looked at the High Availability docs today, and how you can build a custom transport, but I'm not very familiar with Go, so would not know where to start with this. Any pointers on this, if it would help resolve this issue would be very much appreciated.

Thanks for all your great work on this project.

toby-griffiths commented 4 months ago

Is anyone able to give me any pointed on this one as we're now approaching a produciton launch and I'd prefer if we didn't gave to do all our deploys out of hours when we can have a brief outage for the update?

Any pointers/thoughts/ideas are very welcome. Thank you.

dunglas commented 4 months ago

I guess that Docker starts a new container before stopping the existing one. This is an issue when using the Bolt transport because BoltDB relies on a lock. The first container must release the lock for the second one to take it.

An option is to upgrade to the (paid) on-premise version, which doesn't have this issue because, unlike Bolt, Redis supports concurrent connections.

Another option would be to patch check if Docker sends some signals to the existing container before starting the new one, catch this signal in the Bolt transport, and close the connection to the Bolt DB immediately (that will release the lock).

dunglas commented 4 months ago

This issue seems to confirm this theory: https://github.com/influxdata/influxdb/issues/24320