Function Execution Time variation.

sadgit commented 5 years ago

Performance of functions is erratic e.g.

Min - 25ms (faster than necessary)
50th @ 260ms (slow but acceptable)
95th @ 5.4 sec (useless)

The work performed in each execution is identical

Accept incoming message from Service Bus Trigger
Convert to Hub request
Relay message to 6 NotificationHubs in parallel async requests via NotificationHubClient.

The two upstream function also have the same unpredictable behaviour.

This function is part of a chain that needs to be timely (not fast) and it is therefore materially affecting the user experience.

Investigative information

Timestamp: 09/01/2019, 11:57:05
Function App version (1.0 or 2.0): 2.0 - F# pre-compiled (run from package)
Function App name:
Function name(s) (as appropriate):
Invocation ID: jEGcMsArsWI= cf85ff6a-efab-40e4-bc8c-fad7bb98d47e
Region: UK South

Repro steps

The issue is not reproducible - but is frequent (50% of executions are slow to staggering).

Expected behavior

Consistent execution time with smaller variability.

Actual behavior

As described.

Known workarounds

NONE - We are considering reverting to or V1 code as it had more reliable performance.

mikhailshilkov commented 5 years ago

Adding @paulbatum @cgillum for visibility

cgillum commented 5 years ago

@sadgit How are you measuring execution time?

sadgit commented 5 years ago

@cgillum Through AppInsights - using the performance panel.

Have seen the same function executing fast and slow - simultaneously. Have been chasing this issue since before XMas.

NB - This is on a dedicated service plan (not consumption)

sadgit commented 5 years ago

Execution time - V2

Execution Time - V1 (same code back ported)

paulbatum commented 5 years ago

Hi, I had a chance to look at the logs for this function app and I can see an issue that is probably having significant impact on performance. The worker process is constantly terminating and restarting and while I am having trouble figuring out exactly why, it seems like it correlates with executions of the function in_02_Notifier.

You should see evidence of this recycling yourself in app insights. For example, you should see the process ID for executions in this function app changing very frequently.

I am still investigating. It would be useful if you could take a look in app insights for any unusual error messages that might correlate to the process ID changes.

sadgit commented 5 years ago

@paulbatum - The function is not terminating with an exception (to my knowledge). If you look a few days back you will get a better view. I'm working on reverting the code to V1 as that is more performant.

That function takes an inbound message and sends it out to 6x NotificationHub Clients. This is a workaround for NHub's 1:1 relationship with IoS apps.

The messages are issued in parallel to give equal priority to each message.

paulbatum commented 5 years ago

Even going a few days back (for example, around 2019-01-06T14:00 UTC) I'm seeing these crashes. Will send another update.

sadgit commented 5 years ago

2019-01-12T10:57 UTC - I disabled in_02_Notifier to see whether the 5 minute pings still result in the Host Restart. It does seem to be happening after each Timer Trigger. Also the timer trigger executions are not showing in the AppInsights (perhaps due to the restart event).

sadgit commented 5 years ago

@paulbatum - Found a mistake in my ping code that was throwing an exception - I wasn't expecting the platform to restart on a function internal exception. I assumed (dangerous word) that the platform would catch the exception and log it gracefully.

Have now changed the behaviour of the function to log error to telemetry and finish with a clean exit.

Have re-enabled notifier task - will now monitor performance without restarts.

sadgit commented 5 years ago

perfimage

We have now ported all our functions to run in parallel (in competition) on the same host. Clearly the V1 versions out perform their V2 twins - the code is practically identical (only TraceWriter/ILogger changes).

The V1 twin is both faster and less variable in its execution.

out_01_D seems to be slower (on average) however the 50th percentiles are v1 = 62ms v2 = 85ms.

It is early days - but it seems to confirm that V1 is faster and more consistent.

sadgit commented 5 years ago

@paulbatum - Still have performance issues and also host restart issues when exception raised in client function - see #3945

paulbatum commented 5 years ago

Hello, sorry for the delay. I do not have any major breakthroughs to report, just a few quick observations.

Firstly, assuming that the platform would catch and handle most exceptions is a valid assumption to make. Thank you for filing the separate issue, I replied there.

I looked at the logs and could see your app running up until 1/16 18:00 UTC. I analyzed the logs for the last hour or so of execution. I did not see any signs of the process crashing. I do see sometimes in_02_Notifier failed with an exception:

Executed 'in_02_Notifier' (Failed, Id=2e1f5420-eb66-4fbc-8426-d94c9d7a37a6)
2019-01-16 17:10:07.1386858

Microsoft.Azure.WebJobs.Host.FunctionInvocationException : Exception while executing function: in_02_Notifier ---> System.ArgumentNullException : Value cannot be null.
Parameter name: value
   at Newtonsoft.Json.JsonConvert.DeserializeObject(String value,Type type,JsonSerializerSettings settings)
   at Newtonsoft.Json.JsonConvert.DeserializeObject[T](String value,JsonSerializerSettings settings)
   at Vspan.Af01.Functions.Notifier.NotificationManager.AsNotification(ILogger log,String message) at D:\a\1\s\AzureFunctions\vspan.af01\functions\Notifier\NotificationManager.fs : 82

The functions host continues to operate normally after logging this exception. Are you seeing this exception in your app insights?

So far my goal has been to focus on your errors, and then focus on performance once the application is stable. Have you been able to stabilize the V2 version of your app? If so, can you give me a time window to look at?

sadgit commented 5 years ago

We have switched off the production V2 - apps and replaced each with a V1 equivalent.. The V2 vox...fs02 was disabled at 18:00 on 2019-01-16 as you can see. We will maintain both codebases, but will only run V1 in production.

I could trap the empty message and am unsure why those were being submitted. Are you able to see the app af01-vox... with function names v1_*; you will note the difference in performance between the twins.

Any time in the last week should give you a decent measure.

sadgit commented 5 years ago

@paulbatum - Any news or thoughts? We still have a small number of functions running as V2 - but the performance is frankly comical. Why?

paulbatum commented 5 years ago

@sadgit Right now, I don't really have enough information to move this forward. Let me try to help you understand - here's an anonymized view of activity on your subscription, looking at a count of executions and min/avg/max execution time (in ms) by individual function for the following timerange:

| where PreciseTimeStamp > datetime(2019-02-19 00:00)
| where PreciseTimeStamp < datetime(2019-02-20 00:00)

AppName	FunctionName	HostVersion	ExecutionCount	Min	Avg	Max
d672a21e86	f9a08ee1aa	2.0.12285.0	8	1	481	1438
d672a21e86	cdc97c8419	2.0.12285.0	37	116	10229.2	65486
d672a21e86	cba1971eaf	2.0.12285.0	27	11	208.89	997
4e3c783a7b	f9a08ee1aa	2.0.12285.0	8	34	905.38	3330
cc5cbaa0bb	ed4663c63f	1.0.12299.0	288	89	236.68	10361
cc5cbaa0bb	9e991a8dff	1.0.12299.0	288	0	18.57	401
cc5cbaa0bb	1d1af1432c	1.0.12299.0	288	0	3.5	62
b2c7666918	ed4663c63f	1.0.12299.0	288	17	219.81	4605
b2c7666918	cb684e8911	1.0.12299.0	100	16	300.64	3796
b2c7666918	aca27413cf	1.0.12299.0	106	31	216.09	3890
b2c7666918	9e991a8dff	1.0.12299.0	395	0	26.07	3927
b2c7666918	1d1af1432c	1.0.12299.0	232	0	143.49	10844
a40cd5691a	ef453e0d2a	1.0.12299.0	215	15	147.28	6171
a40cd5691a	e297f463a0	1.0.12299.0	3	1234	1749.67	2390
a40cd5691a	d8f205258e	1.0.12299.0	136	15	165.69	6218
a40cd5691a	bb8f821757	1.0.12299.0	51	62	379.37	8469
a40cd5691a	a82358fb13	1.0.12299.0	2	3657	3664.5	3672
a40cd5691a	9e7559c61e	1.0.12299.0	1	922	922	922
a40cd5691a	97e7020c2a	1.0.12299.0	108	15	64.94	639
a40cd5691a	97ae4e3728	1.0.12299.0	215	15	180.47	8579
a40cd5691a	8d93b4148e	1.0.12299.0	13	15	269.23	1187
a40cd5691a	7dfb4cf677	1.0.12299.0	51	62	320.44	6332
a40cd5691a	7726dae8bc	1.0.12299.0	6	46	173.83	374
a40cd5691a	758d61f26a	1.0.12299.0	288	78	274.09	7424
a40cd5691a	71d2a0ec92	1.0.12299.0	6	46	207.83	437
a40cd5691a	6232101675	1.0.12299.0	1	892	892	892
a40cd5691a	4ab78dd469	1.0.12299.0	28	78	505.67	8421
a40cd5691a	267d294ad2	1.0.12299.0	3	1203	1739.33	2390
a40cd5691a	1f0e62c375	1.0.12299.0	2	3672	3672	3672
a40cd5691a	04b7d688da	1.0.12299.0	9	31	143.78	578
90171b8209	ef453e0d2a	1.0.12299.0	106	15	220.81	5599
90171b8209	eb926af53e	1.0.12299.0	151	30	130.05	8702
90171b8209	e6ab2200c3	1.0.12299.0	83	29	66.78	239
90171b8209	e297f463a0	1.0.12299.0	2	1281	1367	1453
90171b8209	d8f205258e	1.0.12299.0	73	15	78.76	406
90171b8209	cde0fb0dec	1.0.12299.0	83	27	62.08	222
90171b8209	c5f1e96cbf	1.0.12299.0	498	13	102.77	7012
90171b8209	c35ed3c895	1.0.12299.0	155	15	142.63	8845
90171b8209	bb8f821757	1.0.12299.0	41	46	581.41	8829
90171b8209	aa4dc910fd	1.0.12299.0	106	31	155.52	2438
90171b8209	a81f83e979	1.0.12299.0	3	46	66.33	93
90171b8209	9e7559c61e	1.0.12299.0	6	15	462.6	2609
90171b8209	984f5f4fb0	1.0.12299.0	136	89	340.14	6267
90171b8209	97e7020c2a	1.0.12299.0	50	15	62.28	312
90171b8209	97ae4e3728	1.0.12299.0	106	15	292.79	8170
90171b8209	8d93b4148e	1.0.12299.0	4	109	124.5	140
90171b8209	8b34dbc2c0	1.0.12299.0	95	41	86.04	406
90171b8209	7dfb4cf677	1.0.12299.0	41	46	466.84	6855
90171b8209	758d61f26a	1.0.12299.0	1435	0	60.23	7609
90171b8209	6232101675	1.0.12299.0	6	15	306	1159
90171b8209	4ab78dd469	1.0.12299.0	23	63	123.12	421
90171b8209	4116674323	1.0.12299.0	136	78	290.06	2698
90171b8209	3f20f70dd2	1.0.12299.0	106	31	166.4	2504
90171b8209	267d294ad2	1.0.12299.0	2	1281	1367	1453
90171b8209	170aaa4464	1.0.12299.0	189	31	140.89	9001
90171b8209	06fb8462fd	1.0.12299.0	95	44	92.61	406
90171b8209	04b7d688da	1.0.12299.0	4	128	156.75	187
2b5ae2f345	ed4663c63f	1.0.12299.0	288	31	293.03	19390
2b5ae2f345	cb684e8911	1.0.12299.0	97	15	284.81	1937
2b5ae2f345	aca27413cf	1.0.12299.0	104	15	198.18	1609
2b5ae2f345	9e991a8dff	1.0.12299.0	401	0	28.38	3578
2b5ae2f345	2157c80019	1.0.12299.0	489	0	55.76	4753
2b5ae2f345	1d1af1432c	1.0.12299.0	252	0	69.75	1531

A couple of notes:

there are no cases where there is a V1 function and V2 function with the same name, so I can't draw a direct comparison
there is lots of variation in execution time across these V1 functions, and the V2 numbers don't exactly look out of wack other than function 'cdc97c8419'
if I look at some of the executions of 'cdc97c8419', I can pinpoint a set of slow executions to a couple of minutes - 2019-02-19 17:09-17:11
if I look at the activity at that time range, I see 100 executions across 31 different functions and 8 different function apps all on a single 2-core VM

I hope you can appreciate the amount of uncertainty that last point introduces! The data I have simply cannot give me an apples-to-apples comparison between V1 and V2 for your scenario. It is difficult for me to map all of this to your statement about "comical" performance, but that might simply be because you have access to richer information than I do.

My recommendation for how to move this forward - capture a profile for a minute or two of activity that you consider problematic. I'm happy to take a look at such a profile. Easiest way to capture these profiles is through kudu (this blog post is a little old but it basically covers it). Ideally when you share the profile with me you would include some specifics such as "why did functions invocation IDs x/y/z take so long" and you'd give me a UTC timeframe of when the profile was captured so I can match it up with my system logs. Also, a profile with "Collect IIS Events" ticked is more useful (but much larger).

sadgit commented 5 years ago

@paulbatum

Investigative information Timestamp: 2019-03-08T16:01:33.5712015Z Function App version (1.0 or 2.0): 2.0 Function App name: Function name(s) (as appropriate): v2_exception_throw_without_catch Operation ID: f95a75a5811b8f4a890d6d23817ab89d Region: UK South Also in V1

Timestamp: 2019-03-08T16:01:33.428643Z Function App version (1.0 or 2.0): 1.0 Function App name: Function name(s) (as appropriate): v1_exception_throw_without_catch Operation ID: 6b50f47a-e70d-43b8-b63b-be522292504c Region: UK South Repro Steps Enqueue message with body = reboot message = catch -> applies a try-catch wrapper to the same function.

Some changes may have been made to mitigate this error. Code https://github.com/sadgit/AzureFunctionFailureModes

Azure / azure-functions-host