cloudfoundry-community / firehose-to-syslog

Send firehose events from Cloud Foundry to syslog.
MIT License
44 stars 58 forks source link

Exception occurred! Message: Missed Logs Details: #209

Open filidav opened 5 years ago

filidav commented 5 years ago

Any suggestions? Running latest version of code.

[ERR] [2018-10-26 17:40:04.853390705 +0000 UTC m=+2361.613918594] Exception occurred! Message: Missed Logs Details: 100000

jayam18 commented 5 years ago

Same here. Running latest code and experience the same exception. Not sure if anyone figured out why it happens and how to fix it.

lrstanley commented 5 years ago

Frequently happens for us as well. Using firehose-to-syslog on 8+ foundations, and all of them see this. Our log platform seems to be lower than ever, and isn't under high utilization, so I'm hesitant to believe that it is the logging platform that is bottlenecking things. Has happened across multiple versions of firehose-to-syslog as well. I've tried adjusting buffer sizes, and various different options, but can't seem to figure out why.

The layout of this go project is also kind of all over the place, so hard to track down the potential origin (also being unfamiliar with diodes and things like that doesn't help).

Scoobed commented 5 years ago

My team and I were looking into this issue in our deployment

The buffer used is this library -- the readme has go overview why it happens (see basic example)

https://github.com/cloudfoundry/go-diodes/blob/master/README.md

We are testing changing the buffer setting from 10,000 to 100,000 to see if it stops it from happening.

lrstanley commented 5 years ago

@Scoobed -- we did that (raised to 100K), and it happens at 1/10th the frequency, but still happens (and shows it's dropping 50k, 60k, 70k, 80k, 100k, etc). Our logging cluster is not heavily utilized, so we're clueless as to why it's happening.

Scoobed commented 5 years ago

So over the past few days we did some testing around the size of the diode, we have tried multiple sizes like 100,000 and 500,000 but the Missed Logs Message still happens. We also stopped logging when we made it 500,000.

My understanding is the flow is the following .... please correct me if wrong Route Event Thread --> Puts Message from Firehose and puts it in an Envelope in the Diode Diode - is a one way buffer / that can be overwritten ReadLogsBuffer --> consumes the envelope from the Diode and sends the message to rsyslog.

@shinji62 -- Is it possible to increase the number of Threads on the Consume Side as it does not look like it is being able to keep up from what we are seeing after enabling the stats server. Would you have an other thoughts around this piece and its performance?

//Start consumer and reading ingest loop func (f *FirehoseNozzle) Start(ctx context.Context) { f.consumeFirehose() wg.Add(2) go f.routeEvent(ctx) go f.ReadLogsBuffer(ctx) }

sba30 commented 4 years ago

Getting this issue also? Anyone been able to figure the issue out?