Limit queue for unecessary memory usage when things on infra go wrong.

danilobreda commented 4 years ago

When the rabbitmq server goes down, the list of logs in queue begins to grow. and stay in memory until the rabbitmq server comes back so it can send the data. In some cases the server may take longer to return. The application runs out of memory and suddenly dies from the large queue of logs. This only happen on some scenarios...

The solution would be to create a limit for this queue. In case you get to the edge of the queue, remove the oldest records in favor of the new ones or even stop writing new records by throwing them away.

madslyng commented 4 years ago

@danilobreda That's a pretty terrible situation to get into, that your logging-framework is the cause of what eventually can kill your application. And of course it shouldn't be that way.

I'm wondering what the right approach would be here. Because I wouldn't want to lose logs, eventhough memory might be filling up. I could argue that your RabbitMq instance shouldn't be just one instance - it should be redundant. But I understand that not everyone has that priviledge.

I've asked in Serilog's Gitter Chat, if an approach already exists, or if someone has a suggestion. I'll get back to you when I hear back

madslyng commented 4 years ago

@danilobreda I didn't get any response from Serilog Gitter .. but I did some research on my own, and found: https://github.com/serilog/serilog/wiki/Reliability#asynchronousbatched-network-operations

These sinks never fail when events are written, but may fail to asynchronously send a batch in the background. When a batch fails, details are written to SelfLog.

The batch being sent will be held in memory, and will be re-tried at an increasing interval that steps up from 5 seconds to 10 minutes. The increasing interval protects the receiver from a flood of connections when it comes back online after a period of downtime.

If the batch cannot be sent after 4 such attempts, it will be dropped and a new batch attempted. This protects against a "bad" event rejected by the receiver from clogging the logger. A subsequent success will allow other batches to continue transmission normally.

If two more attempts fail (totalling 6 failed attempts, generally around the 10 minute mark) the entire buffer of waiting log events will be dropped. This protects against out-of-memory errors when log events cannot be delivered for a long time.

If the connection remains broken, the buffer will be flushed at 10 minute intervals until the connection is re-established.

Sink authors: by deriving from PeriodicBatchingSink this behavior is provided by default. Implementing a custom ILogEventSink is necessary if different behavior is required.

Serilog.Sinks.RabbitMq derives from PeriodicBatchingSink, and so the scenario you have described, according to Serilog's official documentation should, by default, not happen. And to the best of my knowledge Serilog.Sinks.RabbitMq has not implemented anything to alter that behaviour.

If you want me to be able to find a reason for this happening, then I will need additional information. From the documentation of Serilog this seems unlikely to happen based on PeriodicBatchingSink.

I'll close this issue based on the reasoning that Serilog officially states that cleanup is built-in.

ArieGato / serilog-sinks-rabbitmq

Limit queue for unecessary memory usage when things on infra go wrong. #96