Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.94k stars 441 forks source link

File.Move for Linux dedicated function apps results in lost telemetry #6337

Open ConnorMcMahon opened 4 years ago

ConnorMcMahon commented 4 years ago

Functions logs are written to file by LWAS to guarantee fault tolerance, and the files are periodically rotated so that we do not reach excessive sizes. Our log forwarding service - which is responsible for reading the log files and forwarding the data to our telemetry pipeline - listens for a specific inotify event associated with rotations. When it receives this event, it registers the log file as rotated and searches for the new log file. The File.Move API that is being used to rotate our functions logs is not emitting the correct inotify event, resulting in the log forwarder not being signaled to look for the new logs. By default, our platform initiates periodic restarts of the log forwarding service, which mitigates this issue. However, it can still cause log loss in high volume workloads.

Problem method is LinuxAppServiceFileLogger.RollFiles().

Solution: Use system calls to make sure that the inotify event is emitted.

davidmrdavid commented 4 years ago

@kashimiz please take a look at the LinuxAppServiceFileLogger here for an example of how to port your existing file logger to use a POSIX-compliant syscall.rename utility. That should suffice to notify fluent-bit / td-agent about the file roll.

ConnorMcMahon commented 4 years ago

It may be worth pulling this out into a separate library, as there are plenty of projects in the App Service Linux dedicated eco-system that could benefit from a thread-safe file logger that works with our version of td-agent.

davidmrdavid commented 4 years ago

I just realized that the ProcessLogQueue method of the LinuxAppServiceFileLogger should have a await this.InternalProcessLogQueue(); right after its while loop to make sure the buffer flushes when Stop() is called.

Otherwise, there's a chance logs will be lost there as well.

So currently we have:

        private async Task ProcessLogQueue(object state)
        {
            while (!this.cancellationTokenSource.IsCancellationRequested)
            {
                await this.InternalProcessLogQueue();
                await Task.Delay(TimeSpan.FromSeconds(this.FlushFrequencySeconds), this.cancellationTokenSource.Token).ContinueWith(task => { });
            }

            // ReSharper disable once FunctionNeverReturns
        }

but we should have

        private async Task ProcessLogQueue(object state)
        {
            while (!this.cancellationTokenSource.IsCancellationRequested)
            {
                await this.InternalProcessLogQueue();
                await Task.Delay(TimeSpan.FromSeconds(this.FlushFrequencySeconds), this.cancellationTokenSource.Token).ContinueWith(task => { });
                // if token is cancelled right here, and we have a non-empty buffer, we won't log that data,
            }

            await this.InternalProcessLogQueue(); // to make sure buffer gets flushed after token is cancelled

            // ReSharper disable once FunctionNeverReturns
        }

We've also fixed this in our version of the logger

v-anvari commented 3 years ago

Hi @ConnorMcMahon , Since its an year old issue, can you confirm if this issue needs to be closed or requires a Triage.

ConnorMcMahon commented 3 years ago

This issue requires a triage.

davidmrdavid commented 3 years ago

I don't think this issue has been prioritized yet. I'm happy to help with the coding so long as we get the opinion of some Functions Host owners. @kashimiz, could we get your thoughts on this :) ?

v-anvari commented 3 years ago

Thank you @davidmrdavid , I will be following with the team on this issue

v-anvari commented 3 years ago

Hi @davidmrdavid , I checked on this issue internally and they suggested you to submit a PR.

As suggested tagging @balag0 / @divyagandhisethi for further inputs

davidmrdavid commented 3 years ago

Sounds good. In that case, I'll assign this ticket to myself so I can track it and follow-up with a PR, thank you @v-anvari