Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.94k stars 442 forks source link

Processing delays when scaling out Event Hub triggered function #1601

Open vjrantal opened 7 years ago

vjrantal commented 7 years ago

Repro steps

  1. Create Cosmos DB with Document DB API (and a collection with enough throughput so that the DB is not the bottleneck)

  2. Create Event Hub and a consumer group the function will use

  3. Create a Node.js function that uses the created Event Hub as trigger as the Cosmos DB as the output (see https://gist.github.com/vjrantal/1d5d3361bf6f76a2c698c3104b4262ec for the function.json and index.js files I used in my test)

  4. Start bombarding the Event Hub with messages and observe how Azure Functions scales out the processing (I used Apache Benchmark to post 1 million events to the hub by using round-robin model to distribute events evenly to the allocated Event Hub partitions)

Expected behavior

When the runtime determines scaling out is required, it happens without significant delays in event processing.

Actual behavior

As seen in below screenshot, there seems to be a gap in event processing when the initial scale out occurs. It also seems like the per request duration slows down significantly after the scale out (e.g., 50 percentile duration before scale out is something like ~18 ms, but afterwards something like ~2167 ms).

functions-scaling

Related information

I run the test multiple times and tested with cardinality many instead of one. I also tested without the Document DB output. Furthermore, the amount of Event Hub partitions did not seem to matter (tried with 4 and 32). In all 5 trials, I was able to reproduce similar behavior.

During the gap in event processing, I can see that some errors occur, which are something I think I could not catch in my Node.js "user code":

functions-error

UTC timestamps are in the screenshots and the Function App is called documentdb-function-1496656123610. All resources I created were in North Europe region. I had not touched host.json so everything there were the default values.

vjrantal commented 7 years ago

Run the test again today by enabling the Event Hub auto-inflate feature. That didn't seem to make a difference.

Another interesting view to the data is below that seems to indicate that on each new instance, there is an initial timeout before even processing starts. The timeout is 5 minutes which I believe comes from the default max execution time.

initial-timeouts

Above screenshot is from today from a function named documentdb-function-1496656123615.

Off-topic to the original issue, I checked with query requests | count the amount of requests tracked to Application Insights and in my case that returned 138412 (I waited long enough to make sure all is flushed). The reason this is interesting is that the amount of requests I originally sent to the Event Hub was 200001 and from the output Document DB I see that the amount of documents created was 208681 (the pipeline produces duplicate events, but that is expected). So, seems that not all requests gets logged to Application Insights - I wonder why that is?

brettsam commented 7 years ago

@vjrantal -- you mentioned that you tried this without your DocumentDB output binding. Were the findings the same in that case?

As for the App Insights output, we turn on Adaptive Sampling by default, which means that not all records were likely sent. If you'd like to turn it off, you can do that with the host.json settings here: https://github.com/Azure/Azure-Functions/wiki/App-Insights-(Preview)#hostjson-settings (see applicationInsights -> sampling -> isEnabled)

vjrantal commented 7 years ago

@brettsam The issue was reproducible also when I removed the Document DB output. In that case, the function Node.js code was basically a no-op that just calls context.done().

The Application Insights request count is now clear and the adaptive sampling explains the delta in the count - thanks for the information!

brettsam commented 7 years ago

@paulbatum has been doing a lot of Event Hubs performance investigations -- he may have some insights into what you're seeing.

paulbatum commented 7 years ago

@vjrantal Its worth noting that if you want to keep adaptive sampling on, you can still get execution counts using the aggregated metrics we emit. For example:

customMetrics
| where name == "WeatherDataProcessor Count"
| summarize sum(value)

If you notice any discrepancies with these metrics then do let us know because they should be accurate.

For the processing delays, how long are they? Judging from your graph it looks like they are about 5 minutes in length, is that correct?

vjrantal commented 7 years ago

@paulbatum Yes, the delay seems to be 5 minutes which I believe comes from the default max execution time.

paulbatum commented 7 years ago

@vjrantal interesting observation. You can change the max execution time in the host.json file (see here https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json). It would be great if you could lower that to something like 3 minutes and see if the latency spikes correlate...

vjrantal commented 7 years ago

@paulbatum I now re-tried the scenario by having the following as my host.json:

{
    "functionTimeout": "00:01:00"
}

It looks like the spikes don't directly correlate since I can see that the longest delays were longer than 1 minute:

screen shot 2017-08-04 at 14 46 20

In this run, I had a bit over million messages in an Event Hub and then a function like this https://gist.github.com/vjrantal/49803ec907f78e15b7f5669d013a6262 that processes those messages.

Here is the Application Insights overview timeline from the test run:

screen shot 2017-08-04 at 14 52 09
vjrantal commented 7 years ago

@paulbatum I re-tried now by toggling the cardinality value and it looks like the specific issue I refer to may happen only with "cardinality": "one". I do understand that nobody would do that in a high-scale scenario, but it might still be useful to find out the root cause.

As a related finding, when I had "cardinality": "many", I noticed there were some delays in the beginning of processing, but then things worked smoothly:

screen shot 2017-09-22 at 11 11 22

What I observed was that during the beginning of the processing, Event Hubs metrics were showing some amount of "Other Errors":

screen shot 2017-09-22 at 9 57 18

I tried to hunt down where in the code these errors are handled and I might be wrong, but could it be that they are silently ignored by these lines https://github.com/Azure/azure-webjobs-sdk/blob/901764c1aa610685ce850748103c8554aef3d250/src/Microsoft.Azure.WebJobs.ServiceBus/EventHubs/EventHubListener.cs#L117-L121? Also, I was wondering how checkpointing works when this kind of error occurs, i.e., could I lose messages or will they get re-processed?

If useful, I can re-run the scenario once there is an environment where the "Other Errors" are logged and see what the output is.

paulbatum commented 7 years ago

Ahh that makes lots of sense, I feel stupid for not considering cardinality earlier. We have determined that the single threaded nature of JavaScript makes batched execution (i.e. multiple messages passed to a single function execution) even more critical from a performance perspective. We are considering removing 'single' cardinality for event hubs in our next major version release.

Thank you for the pointer on the silent suppression of event hubs errors. I filed the following bug to track this: https://github.com/Azure/azure-webjobs-sdk/issues/1357. I agree it would be good to investigate what those errors are once we have instrumentation for them.

Checkpointing continues if errors occur when processing messages. There is no functionality built in to store failed messages, you need to do it yourself. If you don't, yes you might have dropped messages. The best way to address this is with appropriate error handling. Unfortunately, the experience for this is not consistent across languages. Here is an example of catching errors when flushing output bindings in C#:

    public static async Task Run(
        EventData sensorEvent,            
        IAsyncCollector<EventData> outputWeatherData,            
        TraceWriter log)
    {
        var output = ProcessSensorEvent(sensorEvent);

        try
        {
            await outputWeatherData.AddAsync(output);            
            await outputWeatherData.FlushAsync();
        }
        catch(Exception e)
        {
            // handle error, log sensorEvent for reprocessing
        }
    }