contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.76k stars 229 forks source link

ReplyError: ERR Unknown worker 22dded38 #343

Closed Laksh47 closed 3 years ago

Laksh47 commented 3 years ago

What's the expected/proper way to handle, when the faktory-server sends back ReplyError: ERR Unknown worker 22dded38?

commands.go#L223

faktory_worker_node library does not handle this signal sent from the faktory-server, any guidance on how to handle this error would be greatly appreciated.

Currently, below exception happens in faktory_worker_node when this signal is received and there's a PR currently open to fix this

(node:43609) UnhandledPromiseRejectionWarning: ReplyError: ERR Unknown worker 22dded38
    at parseError (/Users/laksh/Desktop/repos/kaiju/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/Users/laksh/Desktop/repos/kaiju/node_modules/redis-parser/lib/parser.js:302:14)
node_modules/source-map-support/source-map-support.js:495
(node:43609) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 2)

Should the worker re-establish the connection with the server?

Does it not matter when the faktory-server sends this error since any unacknowledged jobs will be re-queued for execution?

Please advise @mperham

mperham commented 3 years ago

Faktory keeps an live list of all known worker processes. Each process must send a heartbeat every N seconds to prove to Faktory that it is still alive. "Unknown worker" generally means that the process initially sent a heartbeat with that worker ID but didn't send BEAT every N seconds so its registration expired within Faktory after 60 seconds. All worker processes must BEAT at least every 60 seconds, 15 seconds is recommended.

It sounds like you got a reply from the f_w_n maintainer and it might be due to event loop blockage.

Laksh47 commented 3 years ago

Thanks @mperham

Laksh47 commented 3 years ago
2021-03-31T19:39:01.286Z faktory-worker:client:heart BEAT 5f705a3d
2021-03-31T19:39:01.293Z faktory-worker:connection SEND: BEAT {"wid":"5f705a3d"}
2021-03-31T19:39:01.355Z faktory-worker:connection OK
2021-03-31T19:39:01.355Z faktory-worker:connection client='BEAT {"wid":"5f705a3d"}', server='OK'
2021-03-31T19:39:01.358Z faktory-worker:client:heart GOT BEAT RESPONSE OK
2021-03-31T19:39:05.026Z faktory-worker:connection timeout
2021-03-31T19:39:05.069Z faktory-worker:connection timeout
2021-03-31T19:39:05.746Z faktory-worker:connection timeout
2021-03-31T19:39:05.893Z faktory-worker:connection timeout
2021-03-31T19:39:06.121Z faktory-worker:connection timeout
2021-03-31T19:39:10.326Z faktory-worker:connection-pool -1
2021-03-31T19:39:10.328Z faktory-worker:connection-pool -1
2021-03-31T19:39:10.338Z faktory-worker:connection close
2021-03-31T19:39:10.341Z faktory-worker:connection close
2021-03-31T19:39:11.107Z faktory-worker:connection SEND: PUSH {...REDACTED...}
2021-03-31T19:39:11.111Z faktory-worker:connection OK
2021-03-31T19:39:11.111Z faktory-worker:connection client='PUSH {...REDACTED...}'
2021-03-31T19:39:11.126Z faktory-worker:connection SEND: ACK {"jid":"67962123-2765-4a82-bf1a-bdaa2afed4c2"}
2021-03-31T19:39:11.130Z faktory-worker:connection OK
2021-03-31T19:39:11.130Z faktory-worker:connection client='ACK {"jid":"67962123-2765-4a82-bf1a-bdaa2afed4c2"}', server='OK'
2021-03-31T19:39:11.133Z faktory-worker:worker ACK 67962123-2765-4a82-bf1a-bdaa2afed4c2
2021-03-31T19:39:11.141Z faktory-worker:connection SEND: FETCH queue
2021-03-31T19:39:11.144Z faktory-worker:connection {...REDACTED...}
2021-03-31T19:39:11.145Z faktory-worker:connection client='FETCH queue', server='{...REDACTED...}'
2021-03-31T19:39:11.147Z faktory-worker:worker executing 9091ee6f-730a-4f07-a09a-e224fc2d954e
2021-03-31T19:39:15.366Z faktory-worker:connection-pool -1
2021-03-31T19:39:15.369Z faktory-worker:connection-pool -1
2021-03-31T19:39:15.381Z faktory-worker:connection close
2021-03-31T19:39:15.385Z faktory-worker:connection close
2021-03-31T19:39:16.288Z faktory-worker:client:heart BEAT 5f705a3d
2021-03-31T19:39:16.297Z faktory-worker:connection SEND: BEAT {"wid":"5f705a3d"}
2021-03-31T19:39:16.301Z faktory-worker:connection ReplyError: ERR Unknown worker 5f705a3d
    at parseError (...node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (...node_modules/redis-parser/lib/parser.js:302:14)
2021-03-31T19:39:16.301Z faktory-worker:connection client='BEAT {"wid":"5f705a3d"}', server=undefined
2021-03-31T19:39:16.304Z faktory-worker:connection-pool -1
2021-03-31T19:39:16.307Z faktory-worker:connection-pool +1
2021-03-31T19:39:16.308Z faktory-worker:connection connecting
(node:29039) UnhandledPromiseRejectionWarning: ReplyError: ERR Unknown worker 5f705a3d
    at parseError (...node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (...node_modules/redis-parser/lib/parser.js:302:14)
(node:29039) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 4)
2021-03-31T19:39:16.332Z faktory-worker:connection close
2021-03-31T19:39:16.337Z faktory-worker:connection HI {"v":2,"i":4284,"s":"feb24ce99aa99ff"}
2021-03-31T19:39:16.339Z faktory-worker:client handshake
2021-03-31T19:39:16.352Z faktory-worker:connection SEND: HELLO {"hostname":"...","v":2,"labels":["..."],"pid":29039,"wid":"5f705a3d","pwdhash":"..."}
2021-03-31T19:39:16.357Z faktory-worker:connection OK
2021-03-31T19:39:16.357Z faktory-worker:connection client='HELLO {"hostname":"...","v":2,"labels":["..."],"pid":29039,"wid":"5f705a3d","pwdhash":"..."}', server='OK'
2021-03-31T19:39:17.763Z faktory-worker:connection SEND: PUSH {...}
2021-03-31T19:39:17.769Z faktory-worker:connection-pool +1
2021-03-31T19:39:17.770Z faktory-worker:connection connecting
2021-03-31T19:39:17.775Z faktory-worker:connection OK

I see the worker is sending beats every 15 seconds and all of a sudden the faktory-server return Unknown worker ERR the worker seems to be healthy again in the next beat!

could you please explain why this happens? @mperham

It can be observed that the worker id is same 5f705a3d

mperham commented 3 years ago

Is it possible Faktory restarted at 19:39:10? I don’t see why those connections closed.

On Wed, Mar 31, 2021 at 13:37 Laksh @.***> wrote:

2021-03-31T19:39:01.286Z faktory-worker:client:heart BEAT 5f705a3d 2021-03-31T19:39:01.293Z faktory-worker:connection SEND: BEAT {"wid":"5f705a3d"} 2021-03-31T19:39:01.355Z faktory-worker:connection OK 2021-03-31T19:39:01.355Z faktory-worker:connection client='BEAT {"wid":"5f705a3d"}', server='OK' 2021-03-31T19:39:01.358Z faktory-worker:client:heart GOT BEAT RESPONSE OK 2021-03-31T19:39:05.026Z faktory-worker:connection timeout 2021-03-31T19:39:05.069Z faktory-worker:connection timeout 2021-03-31T19:39:05.746Z faktory-worker:connection timeout 2021-03-31T19:39:05.893Z faktory-worker:connection timeout 2021-03-31T19:39:06.121Z faktory-worker:connection timeout 2021-03-31T19:39:10.326Z faktory-worker:connection-pool -1 2021-03-31T19:39:10.328Z faktory-worker:connection-pool -1 2021-03-31T19:39:10.338Z faktory-worker:connection close 2021-03-31T19:39:10.341Z faktory-worker:connection close 2021-03-31T19:39:11.107Z faktory-worker:connection SEND: PUSH {...REDACTED...} 2021-03-31T19:39:11.111Z faktory-worker:connection OK 2021-03-31T19:39:11.111Z faktory-worker:connection client='PUSH {...REDACTED...}' 2021-03-31T19:39:11.126Z faktory-worker:connection SEND: ACK {"jid":"67962123-2765-4a82-bf1a-bdaa2afed4c2"} 2021-03-31T19:39:11.130Z faktory-worker:connection OK 2021-03-31T19:39:11.130Z faktory-worker:connection client='ACK {"jid":"67962123-2765-4a82-bf1a-bdaa2afed4c2"}', server='OK' 2021-03-31T19:39:11.133Z faktory-worker:worker ACK 67962123-2765-4a82-bf1a-bdaa2afed4c2 2021-03-31T19:39:11.141Z faktory-worker:connection SEND: FETCH queue 2021-03-31T19:39:11.144Z faktory-worker:connection {...REDACTED...} 2021-03-31T19:39:11.145Z faktory-worker:connection client='FETCH queue', server='{...REDACTED...}' 2021-03-31T19:39:11.147Z faktory-worker:worker executing 9091ee6f-730a-4f07-a09a-e224fc2d954e 2021-03-31T19:39:15.366Z faktory-worker:connection-pool -1 2021-03-31T19:39:15.369Z faktory-worker:connection-pool -1 2021-03-31T19:39:15.381Z faktory-worker:connection close 2021-03-31T19:39:15.385Z faktory-worker:connection close 2021-03-31T19:39:16.288Z faktory-worker:client:heart BEAT 5f705a3d 2021-03-31T19:39:16.297Z faktory-worker:connection SEND: BEAT {"wid":"5f705a3d"} 2021-03-31T19:39:16.301Z faktory-worker:connection ReplyError: ERR Unknown worker 5f705a3d at parseError (...node_modules/redis-parser/lib/parser.js:179:12) at parseType (...node_modules/redis-parser/lib/parser.js:302:14) 2021-03-31T19:39:16.301Z faktory-worker:connection client='BEAT {"wid":"5f705a3d"}', server=undefined 2021-03-31T19:39:16.304Z faktory-worker:connection-pool -1 2021-03-31T19:39:16.307Z faktory-worker:connection-pool +1 2021-03-31T19:39:16.308Z faktory-worker:connection connecting (node:29039) UnhandledPromiseRejectionWarning: ReplyError: ERR Unknown worker 5f705a3d at parseError (...node_modules/redis-parser/lib/parser.js:179:12) at parseType (...node_modules/redis-parser/lib/parser.js:302:14) (node:29039) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 4) 2021-03-31T19:39:16.332Z faktory-worker:connection close 2021-03-31T19:39:16.337Z faktory-worker:connection HI {"v":2,"i":4284,"s":"feb24ce99aa99ff"} 2021-03-31T19:39:16.339Z faktory-worker:client handshake 2021-03-31T19:39:16.352Z faktory-worker:connection SEND: HELLO {"hostname":"...","v":2,"labels":["..."],"pid":29039,"wid":"5f705a3d","pwdhash":"..."} 2021-03-31T19:39:16.357Z faktory-worker:connection OK 2021-03-31T19:39:16.357Z faktory-worker:connection client='HELLO {"hostname":"...","v":2,"labels":["..."],"pid":29039,"wid":"5f705a3d","pwdhash":"..."}', server='OK' 2021-03-31T19:39:17.763Z faktory-worker:connection SEND: PUSH {...} 2021-03-31T19:39:17.769Z faktory-worker:connection-pool +1 2021-03-31T19:39:17.770Z faktory-worker:connection connecting 2021-03-31T19:39:17.775Z faktory-worker:connection OK

I see the worker is sending beats every 15 seconds and all of a sudden the faktory-server return Unknown worker ERR the worker seems to be healthy again in the next beat!

could you please explain why this happens? @mperham https://github.com/mperham

It can be observed that the worker id is same 5f705a3d

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/contribsys/faktory/issues/343#issuecomment-811448587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWX5ETB5MEGN2P57BX5TTGOBYDANCNFSM4Z6UFHHQ .

Laksh47 commented 3 years ago

No, faktory was not restarted during the time @mperham

mperham commented 3 years ago

I’d debug why you have client connections timing out and closing. I don’t know any more than that.

jbielick commented 3 years ago

There's a BEAT at 19:39:01 acknowledged by the server:

2021-03-31T19:39:01.355Z faktory-worker:connection client='BEAT {"wid":"5f705a3d"}', server='OK'

And another BEAT at 19:39:16 (as expected) not acknowledged by the server:

2021-03-31T19:39:16.301Z faktory-worker:connection client='BEAT {"wid":"5f705a3d"}', server=undefined

Since this is only 15 seconds apart, the server should not have forgotten the worker by then. I think the most likely explanation (which the timeouts also suggest) is that your server process is being forcibly restarted by the kernel or otherwise. Do you have server logs around the same time window?