Open sadgit opened 5 years ago
Adding @paulbatum @cgillum for visibility
@sadgit How are you measuring execution time?
@cgillum Through AppInsights - using the performance panel.
Have seen the same function executing fast and slow - simultaneously. Have been chasing this issue since before XMas.
NB - This is on a dedicated service plan (not consumption)
Execution time - V2
Execution Time - V1 (same code back ported)
Hi, I had a chance to look at the logs for this function app and I can see an issue that is probably having significant impact on performance. The worker process is constantly terminating and restarting and while I am having trouble figuring out exactly why, it seems like it correlates with executions of the function in_02_Notifier
.
You should see evidence of this recycling yourself in app insights. For example, you should see the process ID for executions in this function app changing very frequently.
I am still investigating. It would be useful if you could take a look in app insights for any unusual error messages that might correlate to the process ID changes.
@paulbatum - The function is not terminating with an exception (to my knowledge). If you look a few days back you will get a better view. I'm working on reverting the code to V1 as that is more performant.
That function takes an inbound message and sends it out to 6x NotificationHub Clients. This is a workaround for NHub's 1:1 relationship with IoS apps.
The messages are issued in parallel to give equal priority to each message.
Even going a few days back (for example, around 2019-01-06T14:00 UTC) I'm seeing these crashes. Will send another update.
2019-01-12T10:57 UTC - I disabled in_02_Notifier
to see whether the 5 minute pings still result in the Host Restart. It does seem to be happening after each Timer Trigger. Also the timer trigger executions are not showing in the AppInsights (perhaps due to the restart event).
@paulbatum - Found a mistake in my ping code that was throwing an exception - I wasn't expecting the platform to restart on a function internal exception. I assumed (dangerous word) that the platform would catch the exception and log it gracefully.
Have now changed the behaviour of the function to log error to telemetry and finish with a clean exit.
Have re-enabled notifier task - will now monitor performance without restarts.
We have now ported all our functions to run in parallel (in competition) on the same host. Clearly the V1 versions out perform their V2 twins - the code is practically identical (only TraceWriter/ILogger changes).
The V1 twin is both faster and less variable in its execution.
out_01_D seems to be slower (on average) however the 50th percentiles are v1 = 62ms v2 = 85ms.
It is early days - but it seems to confirm that V1 is faster and more consistent.
@paulbatum - Still have performance issues and also host restart issues when exception raised in client function - see #3945
Hello, sorry for the delay. I do not have any major breakthroughs to report, just a few quick observations.
Firstly, assuming that the platform would catch and handle most exceptions is a valid assumption to make. Thank you for filing the separate issue, I replied there.
I looked at the logs and could see your app running up until 1/16 18:00 UTC. I analyzed the logs for the last hour or so of execution. I did not see any signs of the process crashing. I do see sometimes in_02_Notifier
failed with an exception:
Executed 'in_02_Notifier' (Failed, Id=2e1f5420-eb66-4fbc-8426-d94c9d7a37a6)
2019-01-16 17:10:07.1386858
Microsoft.Azure.WebJobs.Host.FunctionInvocationException : Exception while executing function: in_02_Notifier ---> System.ArgumentNullException : Value cannot be null.
Parameter name: value
at Newtonsoft.Json.JsonConvert.DeserializeObject(String value,Type type,JsonSerializerSettings settings)
at Newtonsoft.Json.JsonConvert.DeserializeObject[T](String value,JsonSerializerSettings settings)
at Vspan.Af01.Functions.Notifier.NotificationManager.AsNotification(ILogger log,String message) at D:\a\1\s\AzureFunctions\vspan.af01\functions\Notifier\NotificationManager.fs : 82
The functions host continues to operate normally after logging this exception. Are you seeing this exception in your app insights?
So far my goal has been to focus on your errors, and then focus on performance once the application is stable. Have you been able to stabilize the V2 version of your app? If so, can you give me a time window to look at?
We have switched off the production V2 - apps and replaced each with a V1 equivalent.. The V2 vox...fs02 was disabled at 18:00 on 2019-01-16 as you can see. We will maintain both codebases, but will only run V1 in production.
I could trap the empty message and am unsure why those were being submitted. Are you able to see the app af01-vox... with function names v1_*; you will note the difference in performance between the twins.
Any time in the last week should give you a decent measure.
@paulbatum - Any news or thoughts? We still have a small number of functions running as V2 - but the performance is frankly comical. Why?
@sadgit Right now, I don't really have enough information to move this forward. Let me try to help you understand - here's an anonymized view of activity on your subscription, looking at a count of executions and min/avg/max execution time (in ms) by individual function for the following timerange:
| where PreciseTimeStamp > datetime(2019-02-19 00:00)
| where PreciseTimeStamp < datetime(2019-02-20 00:00)
AppName | FunctionName | HostVersion | ExecutionCount | Min | Avg | Max |
---|---|---|---|---|---|---|
d672a21e86 | f9a08ee1aa | 2.0.12285.0 | 8 | 1 | 481 | 1438 |
d672a21e86 | cdc97c8419 | 2.0.12285.0 | 37 | 116 | 10229.2 | 65486 |
d672a21e86 | cba1971eaf | 2.0.12285.0 | 27 | 11 | 208.89 | 997 |
4e3c783a7b | f9a08ee1aa | 2.0.12285.0 | 8 | 34 | 905.38 | 3330 |
cc5cbaa0bb | ed4663c63f | 1.0.12299.0 | 288 | 89 | 236.68 | 10361 |
cc5cbaa0bb | 9e991a8dff | 1.0.12299.0 | 288 | 0 | 18.57 | 401 |
cc5cbaa0bb | 1d1af1432c | 1.0.12299.0 | 288 | 0 | 3.5 | 62 |
b2c7666918 | ed4663c63f | 1.0.12299.0 | 288 | 17 | 219.81 | 4605 |
b2c7666918 | cb684e8911 | 1.0.12299.0 | 100 | 16 | 300.64 | 3796 |
b2c7666918 | aca27413cf | 1.0.12299.0 | 106 | 31 | 216.09 | 3890 |
b2c7666918 | 9e991a8dff | 1.0.12299.0 | 395 | 0 | 26.07 | 3927 |
b2c7666918 | 1d1af1432c | 1.0.12299.0 | 232 | 0 | 143.49 | 10844 |
a40cd5691a | ef453e0d2a | 1.0.12299.0 | 215 | 15 | 147.28 | 6171 |
a40cd5691a | e297f463a0 | 1.0.12299.0 | 3 | 1234 | 1749.67 | 2390 |
a40cd5691a | d8f205258e | 1.0.12299.0 | 136 | 15 | 165.69 | 6218 |
a40cd5691a | bb8f821757 | 1.0.12299.0 | 51 | 62 | 379.37 | 8469 |
a40cd5691a | a82358fb13 | 1.0.12299.0 | 2 | 3657 | 3664.5 | 3672 |
a40cd5691a | 9e7559c61e | 1.0.12299.0 | 1 | 922 | 922 | 922 |
a40cd5691a | 97e7020c2a | 1.0.12299.0 | 108 | 15 | 64.94 | 639 |
a40cd5691a | 97ae4e3728 | 1.0.12299.0 | 215 | 15 | 180.47 | 8579 |
a40cd5691a | 8d93b4148e | 1.0.12299.0 | 13 | 15 | 269.23 | 1187 |
a40cd5691a | 7dfb4cf677 | 1.0.12299.0 | 51 | 62 | 320.44 | 6332 |
a40cd5691a | 7726dae8bc | 1.0.12299.0 | 6 | 46 | 173.83 | 374 |
a40cd5691a | 758d61f26a | 1.0.12299.0 | 288 | 78 | 274.09 | 7424 |
a40cd5691a | 71d2a0ec92 | 1.0.12299.0 | 6 | 46 | 207.83 | 437 |
a40cd5691a | 6232101675 | 1.0.12299.0 | 1 | 892 | 892 | 892 |
a40cd5691a | 4ab78dd469 | 1.0.12299.0 | 28 | 78 | 505.67 | 8421 |
a40cd5691a | 267d294ad2 | 1.0.12299.0 | 3 | 1203 | 1739.33 | 2390 |
a40cd5691a | 1f0e62c375 | 1.0.12299.0 | 2 | 3672 | 3672 | 3672 |
a40cd5691a | 04b7d688da | 1.0.12299.0 | 9 | 31 | 143.78 | 578 |
90171b8209 | ef453e0d2a | 1.0.12299.0 | 106 | 15 | 220.81 | 5599 |
90171b8209 | eb926af53e | 1.0.12299.0 | 151 | 30 | 130.05 | 8702 |
90171b8209 | e6ab2200c3 | 1.0.12299.0 | 83 | 29 | 66.78 | 239 |
90171b8209 | e297f463a0 | 1.0.12299.0 | 2 | 1281 | 1367 | 1453 |
90171b8209 | d8f205258e | 1.0.12299.0 | 73 | 15 | 78.76 | 406 |
90171b8209 | cde0fb0dec | 1.0.12299.0 | 83 | 27 | 62.08 | 222 |
90171b8209 | c5f1e96cbf | 1.0.12299.0 | 498 | 13 | 102.77 | 7012 |
90171b8209 | c35ed3c895 | 1.0.12299.0 | 155 | 15 | 142.63 | 8845 |
90171b8209 | bb8f821757 | 1.0.12299.0 | 41 | 46 | 581.41 | 8829 |
90171b8209 | aa4dc910fd | 1.0.12299.0 | 106 | 31 | 155.52 | 2438 |
90171b8209 | a81f83e979 | 1.0.12299.0 | 3 | 46 | 66.33 | 93 |
90171b8209 | 9e7559c61e | 1.0.12299.0 | 6 | 15 | 462.6 | 2609 |
90171b8209 | 984f5f4fb0 | 1.0.12299.0 | 136 | 89 | 340.14 | 6267 |
90171b8209 | 97e7020c2a | 1.0.12299.0 | 50 | 15 | 62.28 | 312 |
90171b8209 | 97ae4e3728 | 1.0.12299.0 | 106 | 15 | 292.79 | 8170 |
90171b8209 | 8d93b4148e | 1.0.12299.0 | 4 | 109 | 124.5 | 140 |
90171b8209 | 8b34dbc2c0 | 1.0.12299.0 | 95 | 41 | 86.04 | 406 |
90171b8209 | 7dfb4cf677 | 1.0.12299.0 | 41 | 46 | 466.84 | 6855 |
90171b8209 | 758d61f26a | 1.0.12299.0 | 1435 | 0 | 60.23 | 7609 |
90171b8209 | 6232101675 | 1.0.12299.0 | 6 | 15 | 306 | 1159 |
90171b8209 | 4ab78dd469 | 1.0.12299.0 | 23 | 63 | 123.12 | 421 |
90171b8209 | 4116674323 | 1.0.12299.0 | 136 | 78 | 290.06 | 2698 |
90171b8209 | 3f20f70dd2 | 1.0.12299.0 | 106 | 31 | 166.4 | 2504 |
90171b8209 | 267d294ad2 | 1.0.12299.0 | 2 | 1281 | 1367 | 1453 |
90171b8209 | 170aaa4464 | 1.0.12299.0 | 189 | 31 | 140.89 | 9001 |
90171b8209 | 06fb8462fd | 1.0.12299.0 | 95 | 44 | 92.61 | 406 |
90171b8209 | 04b7d688da | 1.0.12299.0 | 4 | 128 | 156.75 | 187 |
2b5ae2f345 | ed4663c63f | 1.0.12299.0 | 288 | 31 | 293.03 | 19390 |
2b5ae2f345 | cb684e8911 | 1.0.12299.0 | 97 | 15 | 284.81 | 1937 |
2b5ae2f345 | aca27413cf | 1.0.12299.0 | 104 | 15 | 198.18 | 1609 |
2b5ae2f345 | 9e991a8dff | 1.0.12299.0 | 401 | 0 | 28.38 | 3578 |
2b5ae2f345 | 2157c80019 | 1.0.12299.0 | 489 | 0 | 55.76 | 4753 |
2b5ae2f345 | 1d1af1432c | 1.0.12299.0 | 252 | 0 | 69.75 | 1531 |
A couple of notes:
I hope you can appreciate the amount of uncertainty that last point introduces! The data I have simply cannot give me an apples-to-apples comparison between V1 and V2 for your scenario. It is difficult for me to map all of this to your statement about "comical" performance, but that might simply be because you have access to richer information than I do.
My recommendation for how to move this forward - capture a profile for a minute or two of activity that you consider problematic. I'm happy to take a look at such a profile. Easiest way to capture these profiles is through kudu (this blog post is a little old but it basically covers it). Ideally when you share the profile with me you would include some specifics such as "why did functions invocation IDs x/y/z take so long" and you'd give me a UTC timeframe of when the profile was captured so I can match it up with my system logs. Also, a profile with "Collect IIS Events" ticked is more useful (but much larger).
@paulbatum
Investigative information Timestamp: 2019-03-08T16:01:33.5712015Z Function App version (1.0 or 2.0): 2.0 Function App name: Function name(s) (as appropriate): v2_exception_throw_without_catch Operation ID: f95a75a5811b8f4a890d6d23817ab89d Region: UK South Also in V1
Timestamp: 2019-03-08T16:01:33.428643Z Function App version (1.0 or 2.0): 1.0 Function App name: Function name(s) (as appropriate): v1_exception_throw_without_catch Operation ID: 6b50f47a-e70d-43b8-b63b-be522292504c Region: UK South Repro Steps Enqueue message with body = reboot message = catch -> applies a try-catch wrapper to the same function.
Some changes may have been made to mitigate this error. Code https://github.com/sadgit/AzureFunctionFailureModes
Performance of functions is erratic e.g.
The work performed in each execution is identical
The two upstream function also have the same unpredictable behaviour.
This function is part of a chain that needs to be timely (not fast) and it is therefore materially affecting the user experience.
Investigative information
Repro steps
The issue is not reproducible - but is frequent (50% of executions are slow to staggering).
Expected behavior
Consistent execution time with smaller variability.
Actual behavior
As described.
Known workarounds
NONE - We are considering reverting to or V1 code as it had more reliable performance.