log-courier 2.0.3 doesn't recover from logstash/elsastic search problems lasting several hours

PMDubuc commented 8 years ago

I recently has a problem that caused logstash to start sending invalid field values to Elasticsearch which rejected the event. This caused the logstash pipeline to back up and all the log-courier processes started failing with messages on stdout like the following. The incident lasted about 10 hours. It was corrected and Logstash (2.3.2) and Elasticsearch (2.3.3) were restarted. The log-courier processes didn't recover until they were all restarted a few hours later. Is there a way to fix this?

2016/06/08 16:02:11.963965 [elkserver:5012] Failing endpoint: Server did not respond within network timeout 
2016/06/08 16:02:11.963995 [elkserver:5012] Marking endpoint as failed
2016/06/08 16:02:11.964021 [elkserver:5012] Transport error: Failed by endpoint manager 
2016/06/08 16:02:11.967818 [elkserver:5012] Disconnected from <ipaddr>:5012 (elkserver)
2016/06/08 16:02:14.484755 Stopping harvest of /logs/aps51/runtime/aps/prod/launch/n254/run/logs/server.log; last change was 1h0m0s ago
2016/06/08 16:02:15.271841 Stopping harvest of /logs/aps51/runtime/aps/prod/launch/n254/run/logs/server.log; last change was 1h0m0s ago
2016/06/08 16:02:16.969403 [elkserver:5012] Attempting to connect to <ipaddr>:5012 (elkserver)
2016/06/08 16:02:17.042304 [elkserver:5012] Connected to <ipaddr>:5012 (elkserver)
2016/06/08 16:02:17.708743 Stopping harvest of /logs/aps51/runtime/aps/prod/launch/n254/run/logs/server.log; last change was 1h0m0s ago
2016/06/08 16:02:32.054493 [elkserver:5012] Failing endpoint: Server did not respond within network timeout 
2016/06/08 16:02:32.054530 [elkserver:5012] Marking endpoint as failed
2016/06/08 16:02:32.054561 [elkserver:5012] Transport error: Failed by endpoint manager 
2016/06/08 16:02:32.058099 [elkserver:5012] Disconnected from <ipaddr>:5012 (elkserver)

driskell commented 8 years ago

I have seen this once on my own system, but was unable to work out what was happening internally. I've not been able to reproduce it locally unfortunately.

If you can enable log level: debug in your configuration, that should shed some light if it happens again. If you're sending logs to syslog it sometimes filters debug level so you may need to set log file: /var/log/something to drop logs somewhere. If you need to rotate it in case it gets big you can reload Log Courier and it'll reopen log files.

I'll keep looking into this.

driskell commented 8 years ago

Just want to note an observation in here so I don't forget. I saw 14 Pending Payloads total. But it appears on reconnect it would only resend 1 Payload. Checking Logstash I could see an acknowledgement was sent for this payload - though couldn't confirm it was received (though looked to be). The endpoint line count did not increase too. On each reconnect its the same single payload.

driskell commented 8 years ago

I found one possible cause for this which would be fairly random. The payload is not reset when it gets retransmitted. So if it was previously "half-acknowledged" before it timed out - it might enter a situation where the ACK from server is dropped and thus it keeps timing out.

Will roll up a 2.0.4 tonight to fix this.

driskell / log-courier

log-courier 2.0.3 doesn't recover from logstash/elsastic search problems lasting several hours #324