contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.78k stars 230 forks source link

Faktory crashing when trying to be redeployed #487

Open ibrahima opened 2 months ago

ibrahima commented 2 months ago

I'm experiencing an issue where the Faktory service is failing to start when trying to redeploy it. This was triggered by an Amazon ECS platform update, so it happened without us explicitly trying to redeploy it. We have it set up to deploy the new instance of a service before decommissioning the old one, but the new one never boots successfully so the old one is never decommissioned. The service tries to mount a shared volume to /var/lib/faktory so I do wonder if there's some contention happening there. I am wondering if shutting down the existing service first will allow a new service to come up successfully but I don't have a way to test that because this issue isn't occurring in our staging environment. Would appreciate any pointers to resolving this, thanks!

logs from Amazon ECS


timestamp message
1725410530251 Faktory 1.9.0
1725410530251 Copyright © 2024 Contributed Systems LLC
1725410530251 Licensed under the GNU Affero Public License 3.0
1725410530251 I 2024-09-04T00:42:10.251Z Initializing redis storage at /var/lib/faktory/db, socket /var/lib/faktory/db/redis.sock
1725410535910 E 2024-09-04T00:42:15.909Z Unable to create Faktory server: context deadline exceeded

Are you using an old version? No Have you checked the changelogs to see if your issue has been fixed in a later version? Yes.

mperham commented 2 months ago

🤷🏻‍♂️ try it and see. The datafile certainly can’t be shared by multiple Redises so that would make sense.

ibrahima commented 2 months ago

yea I think I will... I'm slightly nervous because I don't know what will happen if the service doesn't come back up but right now the service is degraded anyway, so it won't make things much worse.

ibrahima commented 2 months ago

Hmm.. seems to still be dying on start. Do you have an idea where that message might be coming from or any way to get more details? My limited understanding is that something is timing out but I'm not sure what it could be.

mperham commented 2 months ago

What does starting with “-l debug” print out?

ibrahima commented 2 months ago

oh hmm, after a few failed starts it seems like it might have succeeded! so it was probably the shared EFS volume thing, and maybe it took a while after the old service shutting down for the volume to become fully accessible. maybe it was hanging around in some zombie state or something lol... (should not have been the case, but IDK... gremlins...)