Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.94k stars 441 forks source link

Functions Host goes into bad state after OutOfMemoryException #3624

Open cgillum opened 6 years ago

cgillum commented 6 years ago

Investigative information

INTERNAL ONLY - see IcM #8756524 for an instance of this issue with all the details

Summary

The functions host (1.0) was shutdown due to an OutOfMemoryException. The functions host failed to restart itself after being in this condition. The process was kept alive by AlwaysOn, but the host didn't restart until a restart command was manually sent to the App Service infrastructure

Here is the callstack of the OutOfMemoryException. Note that it happens in user code:

Microsoft.Azure.WebJobs.Host.FunctionInvocationException : Exception while executing function: Functions.ExtractAndNotify ---> System.Reflection.TargetInvocationException : Exception has been thrown by the target of an invocation. ---> 
System.OutOfMemoryException : Exception of type 'System.OutOfMemoryException' was thrown.
   at Submission#0.PostErrorMessage(String queueItem,String errorMessage,String baseMessage)
   at Submission#0.Run(String queueItem,TraceWriter log) 
   End of inner exception
   at System.RuntimeMethodHandle.InvokeMethod(Object target,Object[] arguments,Signature sig,Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj,Object[] parameters,Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj,BindingFlags invokeAttr,Binder binder,Object[] parameters,CultureInfo culture)
   at async Microsoft.Azure.WebJobs.Script.Description.DotNetFunctionInvoker.InvokeCore(Object[] parameters,FunctionInvocationContext context)
   at async Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters)
   at async Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker`2.InvokeAsync[TReflected,TReturnType](TReflected instance,Object[] arguments)
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync[TReflected,TReturnValue](Object instance,Object[] arguments)
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeAsync(IFunctionInvoker invoker,ParameterHelper parameterHelper,CancellationTokenSource timeoutTokenSource,CancellationTokenSource functionCancellationTokenSource,Boolean throwOnTimeout,TimeSpan timerInterval,IFunctionInstance instance)
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstance instance,ParameterHelper parameterHelper,TraceWriter traceWriter,CancellationTokenSource functionCancellationTokenSource)
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(??)
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(??) 
   End of inner exception

Functions logs at the time:

TIMESTAMP Pid Summary
2018-10-11 06:30:31.1958763 16512 An unhandled exception has occurred. Host is shutting down.
2018-10-11 06:30:31.2273143 16512 Message has reached MaxDequeueCount of 1. Moving message to queue 'redacted-queue-poison'.
2018-10-11 06:30:31.2709304 16512 Stopping Host
2018-10-11 06:30:31.2808059 16512 Singleton lock released (redacted/Host.Functions.MaterialsReceiveSubscriber.Listener)
2018-10-11 06:30:31.3614276 16512 Job host stopped
2018-10-11 06:30:52.6076518 16512 EnvSettings: SettingsProcessor.Start apply environments to appSettings or connectionStrings
2018-10-11 06:30:52.6076949 16512 EnvSettings: SettingsProcessor.Start completed.
(..long delay – no data..)
2018-10-11 09:02:30.3865452 11360 EnvSettings: SettingsProcessor.Start apply environments to appSettings or connectionStrings
2018-10-11 09:02:30.3868933 11360 EnvSettings: SettingsProcessor.Start completed.
2018-10-11 09:02:30.7642940 11360 Created directory snapshot.
2018-10-11 09:02:30.8362800 11360 Reading host configuration file 'D:\home\site\wwwroot\host.json'

Expected behavior

The functions host should automatically restart after shutting down.

Actual behavior

The functions host stayed in a shutdown state. It continued to respond successfully to ping requests from AlwaysOn.

Known workarounds

Manually restart the function app when it gets into this state.

mathewc commented 6 years ago

Potentially related to https://github.com/Azure/azure-functions-host/issues/2907

mathewc commented 6 years ago

@brettsam Didn't you commit a change fixing this?

brettsam commented 6 years ago

There's been a fix to the v1 ScriptHost leaks, yes. (https://github.com/Azure/azure-webjobs-sdk/pull/1971)

But someone can create a leak themselves. They could add to a static List in each invocation, for example.

The issue here seems to be that when we do get into that scenario, we can't recover without a manual reset. I know we have code checking IsFatal() in a couple of places -- maybe we're missing it here?

mwolin-gcmlp commented 5 years ago

@brettsam any update here? We seem to have created a our own leak and requiring manual intervention is not ideal. Any idea what the timing for a fix is, or if there is a workaround to automatically resetting a failed host?

soninaren commented 5 years ago

@mwolin-gcmlp and @cgillum Can you elaborate on the scenario that would lead to OutOfMemory exception but not cause host to crash. We need to be able to understand / reproduce this to put the correct fix in.

ghost commented 5 years ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

cgillum commented 5 years ago

@soninaren this was quite some time ago and I don't imagine we have the code available. However, from the call stack you can see that the OutOfMemoryException happens in the C# script code, so I think the repro would be:

  1. Create a C# script function app in the portal that allocates too much memory
  2. Trigger it and validate that there is an OutOfMemoryException
  3. Wait a few minutes and try to invoke another function

If you can successfully invoke another function after the OutOfMemoryException then I think we can go ahead and close this issue.