"GELF processing failed irrecoverably" stops the receiver

MattMofDoom commented 3 years ago

Hi guys,

We have a high volume GELF stream from a Mulesoft instance ... not the hugest fan of receiving the logs this way since it encapsulate the app's JSON logs in the GELF packet, but working towards a Log4J appender for Seq that will hopefully address that.

In the past couple of days, this has started encountering a "GELF processing failed irrecoverably" error, which hasn't happened in the ~2 years (roughly) that we've been using it. I can see the immediate cause is the break in the code;

// An unrecoverable error occurred receiving a chunk Some(Err(err)) => { increment!(server.receive_err); emit_err(&err, "GELF processing failed irrecoverably"); break; },

and this is the only apparent case that would result in the receiver stopping. It's not clear to me that this is the 'right' behaviour given that other errors will continue processing - possibly should be some retry behaviour at the least?

The last thing that was done with this input was to update to v2.0.303-dev, but this was back in August. Nonetheless, I've reverted it to v2.0.298 in case it's an actual bug.

For reference, the relevant logs are:

{"@t":"2021-10-07T09:40:47.8312840Z","@mt":"GELF processing failed irrecoverably","@m":"GELF processing failed irrecoverably","@i":"1e99ac5d","@l":"ERROR","@x":"An existing connection was forcibly closed by the remote host. (os error 10054)","AppId":"hostedapp-43","AppInstanceId":"appinstance-45"}

which is followed by the final server metrics that indicate a receive error and TCP connection timeout;

{"@t":"2021-10-07T09:40:47.8848529Z","@mt":"Collected GELF server metrics","@m":"Collected GELF server metrics","@i":"1fcf524e","@l":"DEBUG","process":{"msg":422},"receive":{"chunk":0,"msg_chunked":0,"msg_incomplete_chunk_overflow":0,"msg_unchunked":422},"server":{"process_err":0,"process_ok":423,"receive_err":1,"receive_ok":422,"tcp_conn_accept":205,"tcp_conn_close":636,"tcp_conn_timeout":1,"tcp_msg_overflow":0},"AppId":"hostedapp-43","AppInstanceId":"appinstance-45"}

Cheers,

Matt

nblumhardt commented 3 years ago

Thanks for the report, Matt!

For anyone following along, the line of code in question is:

https://github.com/datalust/sqelf/blob/dev/sqelf/src/server/mod.rs#L228

I suspect that this is a bug - but I'll need to check in with Ashley, who's out for a few days. Just in case it turns out to be relevant - is the input listening on TCP or on UDP?

Thanks again :+1:

MattMofDoom commented 3 years ago

@nblumhardt Good point - it's listening on TCP.

nblumhardt commented 3 years ago

Thanks for the extra info @MattMofDoom

KodrAus commented 3 years ago

Thanks for the report @MattMofDoom!

I’ll take a closer look but I suspect this could be to fail hard in cases that could have torn down the accept listener so the server can be restarted instead of leaving a zombie around.

That particular error should absolutely be recoverable though and it shouldn’t ever be possible for a misbehaved client to bring the server down like that.

MattMofDoom commented 3 years ago

@KodrAus Interestingly, I haven't seen it recur since reverting to v2.0.298. That may be entirely coincidental, but I have an OpsGenie alert in place that would tell me if no logs have been seen for 15 mins. Further steps that I took included updating Seq to the current 2021.2 build, and restarting the server, but those are reasonably standard for periodic maintenance that I perform ... server is restarted at least monthly and I try to keep reasonably up to date on Seq releases.

KodrAus commented 3 years ago

Hey @MattMofDoom :wave: I've made some changes to how we handle errors on client connections that hopefully will fix this. They're published as 2.0.318-dev. If you get a chance to give it a try I'd be very keen to hear how you go!

MattMofDoom commented 3 years ago

Thanks @KodrAus - yeah I saw, thanks very much mate and will do 😁

datalust / seq-input-gelf

"GELF processing failed irrecoverably" stops the receiver #87