Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.95k stars 442 forks source link

Function Execution Time variation. #3932

Open sadgit opened 5 years ago

sadgit commented 5 years ago

Performance of functions is erratic e.g.

The work performed in each execution is identical

  1. Accept incoming message from Service Bus Trigger
  2. Convert to Hub request
  3. Relay message to 6 NotificationHubs in parallel async requests via NotificationHubClient.

The two upstream function also have the same unpredictable behaviour.

This function is part of a chain that needs to be timely (not fast) and it is therefore materially affecting the user experience.

Investigative information

Repro steps

The issue is not reproducible - but is frequent (50% of executions are slow to staggering).

Expected behavior

Consistent execution time with smaller variability.

Actual behavior

As described.

Known workarounds

NONE - We are considering reverting to or V1 code as it had more reliable performance.

mikhailshilkov commented 5 years ago

Adding @paulbatum @cgillum for visibility

cgillum commented 5 years ago

@sadgit How are you measuring execution time?

sadgit commented 5 years ago

@cgillum Through AppInsights - using the performance panel.

Have seen the same function executing fast and slow - simultaneously. Have been chasing this issue since before XMas.

NB - This is on a dedicated service plan (not consumption)

sadgit commented 5 years ago

Execution time - V2

image

Execution Time - V1 (same code back ported)

image

paulbatum commented 5 years ago

Hi, I had a chance to look at the logs for this function app and I can see an issue that is probably having significant impact on performance. The worker process is constantly terminating and restarting and while I am having trouble figuring out exactly why, it seems like it correlates with executions of the function in_02_Notifier.

You should see evidence of this recycling yourself in app insights. For example, you should see the process ID for executions in this function app changing very frequently.

I am still investigating. It would be useful if you could take a look in app insights for any unusual error messages that might correlate to the process ID changes.

sadgit commented 5 years ago

@paulbatum - The function is not terminating with an exception (to my knowledge). If you look a few days back you will get a better view. I'm working on reverting the code to V1 as that is more performant.

That function takes an inbound message and sends it out to 6x NotificationHub Clients. This is a workaround for NHub's 1:1 relationship with IoS apps.

The messages are issued in parallel to give equal priority to each message.

paulbatum commented 5 years ago

Even going a few days back (for example, around 2019-01-06T14:00 UTC) I'm seeing these crashes. Will send another update.

sadgit commented 5 years ago

2019-01-12T10:57 UTC - I disabled in_02_Notifier to see whether the 5 minute pings still result in the Host Restart. It does seem to be happening after each Timer Trigger. Also the timer trigger executions are not showing in the AppInsights (perhaps due to the restart event).

sadgit commented 5 years ago

@paulbatum - Found a mistake in my ping code that was throwing an exception - I wasn't expecting the platform to restart on a function internal exception. I assumed (dangerous word) that the platform would catch the exception and log it gracefully.

Have now changed the behaviour of the function to log error to telemetry and finish with a clean exit.

Have re-enabled notifier task - will now monitor performance without restarts.

sadgit commented 5 years ago

perfimage

We have now ported all our functions to run in parallel (in competition) on the same host. Clearly the V1 versions out perform their V2 twins - the code is practically identical (only TraceWriter/ILogger changes).

The V1 twin is both faster and less variable in its execution.

out_01_D seems to be slower (on average) however the 50th percentiles are v1 = 62ms v2 = 85ms.

It is early days - but it seems to confirm that V1 is faster and more consistent.

sadgit commented 5 years ago

@paulbatum - Still have performance issues and also host restart issues when exception raised in client function - see #3945

paulbatum commented 5 years ago

Hello, sorry for the delay. I do not have any major breakthroughs to report, just a few quick observations.

Firstly, assuming that the platform would catch and handle most exceptions is a valid assumption to make. Thank you for filing the separate issue, I replied there.

I looked at the logs and could see your app running up until 1/16 18:00 UTC. I analyzed the logs for the last hour or so of execution. I did not see any signs of the process crashing. I do see sometimes in_02_Notifier failed with an exception:

Executed 'in_02_Notifier' (Failed, Id=2e1f5420-eb66-4fbc-8426-d94c9d7a37a6)
2019-01-16 17:10:07.1386858

Microsoft.Azure.WebJobs.Host.FunctionInvocationException : Exception while executing function: in_02_Notifier ---> System.ArgumentNullException : Value cannot be null.
Parameter name: value
   at Newtonsoft.Json.JsonConvert.DeserializeObject(String value,Type type,JsonSerializerSettings settings)
   at Newtonsoft.Json.JsonConvert.DeserializeObject[T](String value,JsonSerializerSettings settings)
   at Vspan.Af01.Functions.Notifier.NotificationManager.AsNotification(ILogger log,String message) at D:\a\1\s\AzureFunctions\vspan.af01\functions\Notifier\NotificationManager.fs : 82

The functions host continues to operate normally after logging this exception. Are you seeing this exception in your app insights?

So far my goal has been to focus on your errors, and then focus on performance once the application is stable. Have you been able to stabilize the V2 version of your app? If so, can you give me a time window to look at?

sadgit commented 5 years ago

We have switched off the production V2 - apps and replaced each with a V1 equivalent.. The V2 vox...fs02 was disabled at 18:00 on 2019-01-16 as you can see. We will maintain both codebases, but will only run V1 in production.

I could trap the empty message and am unsure why those were being submitted. Are you able to see the app af01-vox... with function names v1_*; you will note the difference in performance between the twins.

Any time in the last week should give you a decent measure.

sadgit commented 5 years ago

@paulbatum - Any news or thoughts? We still have a small number of functions running as V2 - but the performance is frankly comical. Why?

paulbatum commented 5 years ago

@sadgit Right now, I don't really have enough information to move this forward. Let me try to help you understand - here's an anonymized view of activity on your subscription, looking at a count of executions and min/avg/max execution time (in ms) by individual function for the following timerange:

| where PreciseTimeStamp > datetime(2019-02-19 00:00)
| where PreciseTimeStamp < datetime(2019-02-20 00:00)
AppName FunctionName HostVersion ExecutionCount Min Avg Max
d672a21e86 f9a08ee1aa 2.0.12285.0 8 1 481 1438
d672a21e86 cdc97c8419 2.0.12285.0 37 116 10229.2 65486
d672a21e86 cba1971eaf 2.0.12285.0 27 11 208.89 997
4e3c783a7b f9a08ee1aa 2.0.12285.0 8 34 905.38 3330
cc5cbaa0bb ed4663c63f 1.0.12299.0 288 89 236.68 10361
cc5cbaa0bb 9e991a8dff 1.0.12299.0 288 0 18.57 401
cc5cbaa0bb 1d1af1432c 1.0.12299.0 288 0 3.5 62
b2c7666918 ed4663c63f 1.0.12299.0 288 17 219.81 4605
b2c7666918 cb684e8911 1.0.12299.0 100 16 300.64 3796
b2c7666918 aca27413cf 1.0.12299.0 106 31 216.09 3890
b2c7666918 9e991a8dff 1.0.12299.0 395 0 26.07 3927
b2c7666918 1d1af1432c 1.0.12299.0 232 0 143.49 10844
a40cd5691a ef453e0d2a 1.0.12299.0 215 15 147.28 6171
a40cd5691a e297f463a0 1.0.12299.0 3 1234 1749.67 2390
a40cd5691a d8f205258e 1.0.12299.0 136 15 165.69 6218
a40cd5691a bb8f821757 1.0.12299.0 51 62 379.37 8469
a40cd5691a a82358fb13 1.0.12299.0 2 3657 3664.5 3672
a40cd5691a 9e7559c61e 1.0.12299.0 1 922 922 922
a40cd5691a 97e7020c2a 1.0.12299.0 108 15 64.94 639
a40cd5691a 97ae4e3728 1.0.12299.0 215 15 180.47 8579
a40cd5691a 8d93b4148e 1.0.12299.0 13 15 269.23 1187
a40cd5691a 7dfb4cf677 1.0.12299.0 51 62 320.44 6332
a40cd5691a 7726dae8bc 1.0.12299.0 6 46 173.83 374
a40cd5691a 758d61f26a 1.0.12299.0 288 78 274.09 7424
a40cd5691a 71d2a0ec92 1.0.12299.0 6 46 207.83 437
a40cd5691a 6232101675 1.0.12299.0 1 892 892 892
a40cd5691a 4ab78dd469 1.0.12299.0 28 78 505.67 8421
a40cd5691a 267d294ad2 1.0.12299.0 3 1203 1739.33 2390
a40cd5691a 1f0e62c375 1.0.12299.0 2 3672 3672 3672
a40cd5691a 04b7d688da 1.0.12299.0 9 31 143.78 578
90171b8209 ef453e0d2a 1.0.12299.0 106 15 220.81 5599
90171b8209 eb926af53e 1.0.12299.0 151 30 130.05 8702
90171b8209 e6ab2200c3 1.0.12299.0 83 29 66.78 239
90171b8209 e297f463a0 1.0.12299.0 2 1281 1367 1453
90171b8209 d8f205258e 1.0.12299.0 73 15 78.76 406
90171b8209 cde0fb0dec 1.0.12299.0 83 27 62.08 222
90171b8209 c5f1e96cbf 1.0.12299.0 498 13 102.77 7012
90171b8209 c35ed3c895 1.0.12299.0 155 15 142.63 8845
90171b8209 bb8f821757 1.0.12299.0 41 46 581.41 8829
90171b8209 aa4dc910fd 1.0.12299.0 106 31 155.52 2438
90171b8209 a81f83e979 1.0.12299.0 3 46 66.33 93
90171b8209 9e7559c61e 1.0.12299.0 6 15 462.6 2609
90171b8209 984f5f4fb0 1.0.12299.0 136 89 340.14 6267
90171b8209 97e7020c2a 1.0.12299.0 50 15 62.28 312
90171b8209 97ae4e3728 1.0.12299.0 106 15 292.79 8170
90171b8209 8d93b4148e 1.0.12299.0 4 109 124.5 140
90171b8209 8b34dbc2c0 1.0.12299.0 95 41 86.04 406
90171b8209 7dfb4cf677 1.0.12299.0 41 46 466.84 6855
90171b8209 758d61f26a 1.0.12299.0 1435 0 60.23 7609
90171b8209 6232101675 1.0.12299.0 6 15 306 1159
90171b8209 4ab78dd469 1.0.12299.0 23 63 123.12 421
90171b8209 4116674323 1.0.12299.0 136 78 290.06 2698
90171b8209 3f20f70dd2 1.0.12299.0 106 31 166.4 2504
90171b8209 267d294ad2 1.0.12299.0 2 1281 1367 1453
90171b8209 170aaa4464 1.0.12299.0 189 31 140.89 9001
90171b8209 06fb8462fd 1.0.12299.0 95 44 92.61 406
90171b8209 04b7d688da 1.0.12299.0 4 128 156.75 187
2b5ae2f345 ed4663c63f 1.0.12299.0 288 31 293.03 19390
2b5ae2f345 cb684e8911 1.0.12299.0 97 15 284.81 1937
2b5ae2f345 aca27413cf 1.0.12299.0 104 15 198.18 1609
2b5ae2f345 9e991a8dff 1.0.12299.0 401 0 28.38 3578
2b5ae2f345 2157c80019 1.0.12299.0 489 0 55.76 4753
2b5ae2f345 1d1af1432c 1.0.12299.0 252 0 69.75 1531

A couple of notes:

I hope you can appreciate the amount of uncertainty that last point introduces! The data I have simply cannot give me an apples-to-apples comparison between V1 and V2 for your scenario. It is difficult for me to map all of this to your statement about "comical" performance, but that might simply be because you have access to richer information than I do.

My recommendation for how to move this forward - capture a profile for a minute or two of activity that you consider problematic. I'm happy to take a look at such a profile. Easiest way to capture these profiles is through kudu (this blog post is a little old but it basically covers it). Ideally when you share the profile with me you would include some specifics such as "why did functions invocation IDs x/y/z take so long" and you'd give me a UTC timeframe of when the profile was captured so I can match it up with my system logs. Also, a profile with "Collect IIS Events" ticked is more useful (but much larger).

sadgit commented 5 years ago

@paulbatum

Investigative information Timestamp: 2019-03-08T16:01:33.5712015Z Function App version (1.0 or 2.0): 2.0 Function App name: Function name(s) (as appropriate): v2_exception_throw_without_catch Operation ID: f95a75a5811b8f4a890d6d23817ab89d Region: UK South Also in V1

Timestamp: 2019-03-08T16:01:33.428643Z Function App version (1.0 or 2.0): 1.0 Function App name: Function name(s) (as appropriate): v1_exception_throw_without_catch Operation ID: 6b50f47a-e70d-43b8-b63b-be522292504c Region: UK South Repro Steps Enqueue message with body = reboot message = catch -> applies a try-catch wrapper to the same function.

Some changes may have been made to mitigate this error. Code https://github.com/sadgit/AzureFunctionFailureModes