Open jaekim0908 opened 8 years ago
I'm getting this issue with azure functions v0.6, using blob triggers when using multiple s2 instances (can't use dynamic service plan due to region availability)
2016-09-26T13:15:11.513 An unhandled exception has occurred. Host is shutting down.
2016-09-26T13:15:11.513 Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (409) Conflict. ---> System.Net.WebException: The remote server returned an error: (409) Conflict.
at Microsoft.WindowsAzure.Storage.Shared.Protocol.HttpResponseParsers.ProcessExpectedStatusCodeNoException[T](HttpStatusCode expectedStatusCode, HttpStatusCode actualStatusCode, T retVal, StorageCommandBase`1 cmd, Exception ex)
at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.<>c__DisplayClass33.<RenewLeaseImpl>b__32(RESTCommand`1 cmd, HttpWebResponse resp, Exception ex, OperationContext ctx)
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult)
--- End of inner exception stack trace ---
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result)
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass4.<CreateCallbackVoid>b__3(IAsyncResult ar)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.<ExecuteAsync>d__37.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.<RunAsync>d__d.MoveNext()
Request Information
RequestID:e3b20f97-0001-0019-22f7-17efc9000000
RequestDate:Mon, 26 Sep 2016 13:15:03 GMT
StatusMessage:The lease ID specified did not match the lease ID for the blob.
ErrorCode:LeaseIdMismatchWithLeaseOperation
2016-09-26T13:15:11.950 Stopping Host
@brettsam this is an important one to fix ASAP. @jaekim0908 is internal, and can help us with additional repro details if needed, and he can help test the fix once its ready.
@stephlocke -- are you seeing this regularly? If so, would you be able to turn on Verbose logging briefly to see if we can catch it again? Docs on the settings are here: https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json
Your settings would look like this:
{
"tracing": {
"consoleLevel": "verbose",
"fileLoggingMode": "always"
}
}
Will send more stuff along, but since pushing the new host tracing. I'm getting a lot of this:
2016-10-03T13:52:39.526 Host lock lease renewed.
2016-10-03T13:52:38.479 Host instance '5bb77d899e9648e5d0604a9eebcb15e1' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:39.389 Host instance 'df677cfbf13e4b3330b3566dba85bd22' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:38.540 Host instance 'f84d303d334bcf4bf213ba8be2b0c052' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:39.379 Host instance '4053ca815b15b0baaf2d272c606d5637' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:39.035 Host instance '6e241969103c13eca8aaab7550ad6afb' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:53:02.493 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.493 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.803 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.803 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.182 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.182 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.666 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.666 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.801 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.801 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.464 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.464 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:03.138 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:03.138 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.545 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.545 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.858 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.858 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
Thanks -- that's expected. When you have a Singleton trigger (which Timers are -- you don't want the same timer trigger firing on all of your instances), each host will try to acquire a lease on a specific blob to ensure it's the only one running. Since only one host can have the lease, the others log that they don't have it, which is perfectly fine. The Verbose logs log this a lot -- which we actually have a bug tracking to minimize since it's generally not useful.
In this case, though it may be :-)
I believe the error you're seeing is because one host has a lease and when it goes to renew that lease, sees another host has stolen it away. That shouldn't happen. If it does, it should be very rare and coincide with some kind of issue in Storage where we can't communicate properly with the blob. If you see the 409 StorageException, I'll want to look up at the timestamps of those 'Singleton lock' messages to see what may have happened.
BTW -- I just noticed you are using Blob triggers yet I mentioned Timer triggers. Blob triggers are Singletons as well.
Here is a slightly older one from the logs whilst I dig out /wait for another on verbose
2016-10-01T16:06:10.909 An unhandled exception has occurred. Host is shutting down.
2016-10-01T16:06:10.909 Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (409) Conflict. ---> System.Net.WebException: The remote server returned an error: (409) Conflict.
at Microsoft.WindowsAzure.Storage.Shared.Protocol.HttpResponseParsers.ProcessExpectedStatusCodeNoException[T](HttpStatusCode expectedStatusCode, HttpStatusCode actualStatusCode, T retVal, StorageCommandBase`1 cmd, Exception ex)
at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.<>c__DisplayClass33.<RenewLeaseImpl>b__32(RESTCommand`1 cmd, HttpWebResponse resp, Exception ex, OperationContext ctx)
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult)
--- End of inner exception stack trace ---
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result)
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass4.<CreateCallbackVoid>b__3(IAsyncResult ar)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.<ExecuteAsync>d__37.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.<RunAsync>d__d.MoveNext()
Request Information
RequestID:8e7d23b6-0001-001c-15fd-1b1bb6000000
RequestDate:Sat, 01 Oct 2016 16:06:10 GMT
StatusMessage:The lease ID specified did not match the lease ID for the blob.
ErrorCode:LeaseIdMismatchWithLeaseOperation
Not sure if these 409s are problem with the host or with the storage account when I'm trying to interact with a table
2016-10-03T13:23:01.652 A ScriptHost error has occurred
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Exception while executing function: Functions.blobtoeventhub. mscorlib: Exception has been thrown by the target of an invocation. Microsoft.WindowsAzure.Storage: The remote server returned an error: (409) Conflict. System: The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Exception while executing function: Functions.blobtoeventhub
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Executed: 'Functions.blobtoeventhub' (Failed)
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Function had errors. See Azure WebJobs SDK dashboard for details. Instance ID is 'd6568a31-6b81-49cc-b264-76eba2dd2bdd'
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
@stephlocke -- The 409s coming directly from functions are likely because you're trying to create something in storage with a name that already exists. I'm investigating the one that comes from the host itself -- with RenewLeaseCommand
in the call stack.
Also, you can remove "fileLoggingMode": "always"
from your host.json if you want to save some of your file space. That'll put it back to the default behavior of debugOnly
. We now have runtime logs sent to our backend so I am able to see your logs myself and there's no need for you to keep all of those (unless you want them logging all the time). I'll keep checking to see if I can catch this error, but so far I haven't seen it since you turned on verbose logs.
@stephlocke -- I realized I actually do have your previous verbose logs when you were seeing this crash, so feel free to remove those settings from your host.json.
What I see happening is:
Things are behaving as we would expect, but I'd like to figure out what could cause (A) to go silent and then come back?
Depending on what you mean by silence, this could be because the functions may have been processing large files at the time so none may have been writing to the console for a few minutes. A lot of the older files processed were 30mb files that took 30+ mins to process.
In this case, by 'silence', I mean that I"m not seeing any of the Renewing Singleton lock (functionapp/WebJobs.Internal.Blobs.Listener)
logs that our host writes out with every renewal. That should be happening quite often, and then they just stop for 3 minutes.
I did look at this instance and see that the CPU percentage was quite high (> 90%) just before this. I wonder if our lease-renewal operation just wasn't able to fire due to lack of resources. I'll keep investigating to see if I can repro this and if there's a better way to handle it.
I'm loathe to double my costs by jumping to an S3- I'm filling in a backlog of data so I'm currently running 10 x S2 instances!
I'm going to see if I can figure out exactly where we're falling apart here under load and if there's a better way to catch this. I'll track my findings here.
For reference, this is also on https://social.msdn.microsoft.com/Forums/en-US/08f85617-1767-45d4-b6f4-d8344a835a6f/stability-problem-for-function-in-uk-west-region?forum=AzureFunctions (thread from @stephlocke).
My instance has now gone totally bananas. My live stream from one function is showing logs from the other. I'm thinking that I'm going to have to look at whether I can make a new instance that has less backlog to process from the beginning. Hopefully just having throughput to cope with, will reduce the voltatility. Obviously, it's not GA so buyer beware but it's rather frustrating that the EU and US instances worked fine when they were scaled up, and that I mainly seem to be having troubles on just one data centre.
It also seems pretty weird tbh that this error now happens so much, most often between midnight and 6am that I get total downtime during those periods.
@stephlocke -- one thing you may want to try while we work on a solution is reduce the number of items that your instance dequeues at once. This will obviously slow things down, but it may make them more stable. You can do this by changing the queues.batchSize
in host.json -- https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json. Try cutting it in half to 8 and see if that offers any improvement.
Note: even though you're using blobs, the invocation is still managed by a queue in the background, which is why this setting is relevant.
I actually had it set to the max of 32 – it and the scan frequency seemed the main ways to get it doing a decent amount of instances across 10 machines when I was at peak volume.
Reduced down to 8 for now. Will let you know how it gets on.
Still fracked with reduced queue
I'm facing the same error when debug locally a webjob having several scheduled singleton functions.
@abatishchev -- some questions:
[Singleton("key")]
on multiple functions? [Singleton]
and [Singleton("{prop}"]
(bound from one of the func args). There former are used for a while now and the later were introduced lately so I suspect them. Luckily for us this errors doesn't happen in Azure, only during local debugging. But occurs randomly and kills the console host what is super annoying.On my ongoing stability issue, I'm not sure having more than 2m blobs processed and sitting around in archives is good for the performance of the indexing etc so I'm just doing an archiving process to delete everything older than a fortnight which takes off about a 1m blobs and 2m blob read receipts from the function. I'll then do it in finer grains until the amount is fairly small but my backlog extends ~a week right now. Will let you know whether this helps.
*Update 2016-10-19 Turns out deleting that many blobs was a bigger technical challenge than anticipated. Still ongoing now (via a Function!) but for the past 24hrs my Functions have been up the entire time. So nice to see it processing properly!
@stephlocke -- We have some proposed fixes that may improve your situation, but I'd like to make sure I can reliably repro what you're seeing to ensure our approach will work. Would you mind sending me a direct email so we can get some more details on your function(s)? You can find my email on my profile -- https://github.com/brettsam.
I filed https://github.com/Azure/azure-webjobs-sdk/issues/899 to track a more general related issue.
Clearing the milestone so we can retriage/prioritize this.
seeing this 409 LeaseIdMismatchWithLeaseOperation regularly with our blob trigger. We have very low volume and only (2) instances.
2017-11-28T03:28:58.536 Host lock lease acquired by instance ID '66525d2801dbf933e2e11518b127855f'.
info: Host.General[0]
Host lock lease acquired by instance ID '66525d2801dbf933e2e11518b127855f'.
2017-11-28T03:29:06.300 A ScriptHost error has occurred
2017-11-28T03:29:06.300 System.InvalidOperationException : Singleton lock renewal failed for blob 'locks/my-function/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 0001-01-01T00:00:00Z (-2147483648 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.
2017-11-28T03:29:06.309 'locks/my-function/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 0001-01-01T00:00:00Z (-2147483648 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.
2017-11-28T03:29:06.312 Failed to renew host lock lease: Another host has acquired the lease. The last successful renewal completed at 2017-11-28T03:28:54.229Z (12080 milliseconds ago) with a duration of 780 milliseconds.
Seems to freeze up our host and we have to restart the function to get it working again.
A few years later and still getting this when debugging a brand new web job.
The WebJob has two functions; a schedule/timer trigger that runs once a day as well as a queue based trigger for reprocessing prior days. Both functions have the [Singleton]
applied, with the scheduled function having no singleton params, and the queue function with a string value. The intent is to permit the functions to run at the same time, but only one instance of each function can be executing at any given time.
When using the queue based trigger to run through this year's days to seed the data backlog, the webjob will crash with these exception details.
Microsoft.WindowsAzure.Storage.StorageException
HResult=0x80131500
Message=.
Source=Microsoft.WindowsAzure.Storage
StackTrace:
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result)
at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.EndRenewLease(IAsyncResult asyncResult)
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass7.<CreateCallbackVoid>b__5(IAsyncResult ar)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.BlobLeaseDistributedLockManager.SingletonLockHandle.<RenewAsync>d__17.MoveNext()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.<ExecuteAsync>d__4.MoveNext()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.<RunAsync>d__14.MoveNext()
at Microsoft.Azure.WebJobs.Host.Timers.WebJobsExceptionHandler.<>c__DisplayClass3_0.<OnUnhandledExceptionAsync>b__0()
at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
Inner Exception 1:
WebException: The remote server returned an error: (409) Conflict.
I'm going to change the batch size down to 1 from the default. Its a singleton so I would have expected the processing and the dequeueing to be serial, but for some reason it is dequeueing 8 or 10 or something and probably creating more of a chance to have some kind of lease issue.
@brettsam - will there be any fix coming for this? And specifically a fix that does not require updating to the 3,x SDK (package reference is a breaking change to usability, the non declarative SDK file format makes project task automation more cumbersome and is asking for web site project era trouble of mystery files, and config settings dont belong in a place where we cannot also give instructions about what the values do and what values are permitted )
To be clear -- there's not a fix that we can realistically apply here without a major change to how we invoke functions. We use timers to renew singleton locks and those timers run in the same process as the WebJob. If your WebJob functions use up enough resources such that the timers cannot fire in a reasonable time, the renewals will not occur and you'll end up with this error.
It sounds like your issue is with [Singleton]
-- applying this to your function should restrict it to only a single invocation running at a time. Can you open a separate issue for that?
The [Singleton] is applied to both functions in the webjob, and its not a concurrency problem. Only one function or function instance is running at a time. The only "problem" with the singleton is that I didn't expect it to pull a batch size > 1 from a queue. Knocking down the batch size to 1 and re-running the entire test again did not produce a 409 error and that error had been easily repeatable when running the test with the default batch size.
Sounds like you have a timer vs timeout race with the lock renewal, Why not just set the timer to fire at an interval that will give renewal enough time to act if there is a moderate amount of resource contention?
@hisuwh - let's continue the discussion here as your issue doesn't apply to Functions (the primary host lock is a Functions, not WebJobs concept). Do you have any logs you can share? Or would you rather do this via email, where we can get some better details, including code? If so, feel free to email me -- my microsoft email is in my github profile.
The only relevant log I can see is the one I shared before (here with some more context):
[09:34:13 INF] Job host started
Application started. Press Ctrl+C to shut down.
Hosting environment: Production
Content root path: /app/
[09:36:14 INF] The next 5 occurrences of the 'CleanupFunction.Cleanup' schedule (Cron: '0 0 1 * * *') will be:
10/03/2019 01:00:00+00:00 (10/03/2019 01:00:00Z)
10/04/2019 01:00:00+00:00 (10/04/2019 01:00:00Z)
10/05/2019 01:00:00+00:00 (10/05/2019 01:00:00Z)
10/06/2019 01:00:00+00:00 (10/06/2019 01:00:00Z)
10/07/2019 01:00:00+00:00 (10/07/2019 01:00:00Z)
[09:44:18 ERR] Singleton lock renewal failed for blob '6aaf581a4d74d37580b1ebda440a1171/Company.Application.AppProcess.CleanupFunction.Cleanup.Listener' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-10-02T09:41:14.402Z (184439 milliseconds ago) with a duration of 26 milliseconds. The lease period was 60000 milliseconds.
Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The lease ID specified did not match the lease ID for the blob.
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
at Microsoft.Azure.WebJobs.Host.StorageBaseDistributedLockManager.SingletonLockHandle.RenewAsync(ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host.Storage\Singleton\BlobLeaseDistributedLockManager.cs:line 347
at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.ExecuteAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonManager.cs:line 332
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.RunAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\TaskSeriesTimer.cs:line 147
at System.Threading.Thread.ThreadMain_ThreadStart()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
As you can see it blows up before any functions have started running. So as I said I'm not sure that any code I'm in control of can be impacting this.
Is this WebJobs v3? Can you share your function and startup code? Or email it to me?
Yes Microsoft.Azure.Webjobs version 3.0.13 to be precise.
This is my function
public class CleanupFunction
{
private const string OneAmEveryDay = "0 0 1 * * *";
private readonly IFileCleanup fileCleanup;
private readonly ILogger logger;
/// <summary>
/// Initializes a new instance of the <see cref="CleanupFunction"/> class.
/// </summary>
/// <param name="fileCleanup">The injected <see cref="IFileCleanup"/> instance.</param>
public CleanupFunction(IFileCleanup fileCleanup)
{
this.fileCleanup = fileCleanup;
this.logger = Log.ForContext<CleanupFunction>();
}
/// <summary>
/// Cleans up existing blobs that are older than 90 days.
/// </summary>
/// <param name="timer">The Timer Info.</param>
/// <returns>A Task.</returns>
public async Task Cleanup([TimerTrigger(OneAmEveryDay)]TimerInfo timer)
{
try
{
using (this.logger.BeginTimedOperation("Blob cleanup"))
{
await this.fileCleanup.Cleanup();
}
}
catch (Exception e)
{
this.logger.Error(e, "Error cleaning up blobs");
}
}
}
This is my Program.cs
file:
public sealed class Program
{
/// <summary>
/// The main function.
/// </summary>
public static void Main()
{
var builder = new HostBuilder();
builder.ConfigureWebJobs(b =>
{
b.AddAzureStorageCoreServices();
b.AddAzureStorage();
b.AddTimers();
});
builder.ConfigureAppConfiguration((context, b) =>
{
b.AddJsonFile("appsettings.json");
b.AddUserSecrets<Program>();
b.AddEnvironmentVariables();
});
builder.ConfigureLogging((context, b) =>
{
ConfigureSerilog(context);
b.AddSerilog();
});
builder.ConfigureServices((context, services) =>
{
var storageConfig = new StorageConfiguration
{
ConnectionString = context.Configuration["AzureWebJobsStorage"],
Prefix = context.Configuration["QueuePrefix"]
};
services.AddSingleton<IJobActivator, JobActivator>();
services.AddTransient<CleanupFunction, CleanupFunction>();
services.AddScoped<IFileCleanup, FileCleanup>();
// ...
});
using (var host = builder.Build())
{
Log.Information("Service starting");
try
{
host.Run();
}
catch (Exception e)
{
Log.Fatal(e, "A fatal error occured!");
throw;
}
Log.Information("Service stopping");
}
}
private static void ConfigureSerilog(HostBuilderContext context)
{
// ...
}
}
We use timers to renew singleton locks and those timers run in the same process as the WebJob. If your WebJob functions use up enough resources such that the timers cannot fire in a reasonable time, the renewals will not occur and you'll end up with this error.
@brettsam - I glossed over this before, but EEK! Timers in windows services and other long running processes was always problematic in .NET, and it wasn't just from using "the wrong" timer class. You should strongly consider rewriting that to use something with a more reliable and predictable behavior.
@StingyJack -- Yes, there's no great solution other than moving off of Timers completely. But you need something to renew the lease, or to move the lease renewal out-of-proc, or switch to using some external service to manage locks. All are troublesome, and since this has always been resolvable by the customer with some code changes, we've just punted it over-and-over. Any suggestions to improvements here are welcome :-)
@hisuwh -- I see you're registering some custom services. What if you back them all out so you're only registering the bare minimum? Those constructors are being called, most likely (you can set breakpoints to see), but without knowing exactly what all of your services are doing during startup, it's tough for me to diagnose. If you're able to zip up the project and mail to me I'd be glad to try to repro this myself as well. I'd also recommend running with await host.RunAsync()
rather than Run()
, but I'm not sure if that's the entire issue.
@brettsam I can't really send you the entire source code. I will give your second suggestion a try
Related to Azure/azure-functions-host#1864?
I'm seeing this problem with a blob trigger in a scaled up (8 pod) azure node functions runtime docker image in AKS, It runs fine for several days, then I see this error message and all 8 pods freeze up, for over 24 hours. Restarting 1 pod had no effect. Restarting another 3 pods, effectively at random, got everything moving again, for all 8 pods. Maybe I released a lock? The backup included hundreds of unprocessed blobs, so I don't think the problem could be a single lock on even a small subset of those blobs.
Anyone have any idea what may be causing this? We're still seeing it. It appears to occur pretty consistently and once we see the error, processing in our 6 concurrently running containers stops, usually for many hours.
@derekrprice -- It's caused by thread starvation. The timers that renew locks are not able to get any processor time, so they fail.
The main recommendations are to make sure everything is async everywhere possible and/or reduce function throughput (the specifics depend on what your triggers are and what your functions are doing).
@brettsam Our throughput is fairly high but not ridiculously so, a few thousand invocations a day, but thread starvation doesn't make sense given the pattern that I am seeing. This is a fairly simple log processor based on blob triggers. IOT devices periodically upload logs, which take the functions app only a few seconds to process and save as JSON to a NoSQL DB. We have six pods running concurrently, and all I/O calls are asynchronous.
The processor appears to run anywhere from an hour to about a day, then we see this error message and everything stops. None of the six apps are processing anything at this point, so how could a thread be starved? If I restart the 6 pods, processing resumes on the backlog immediately. Restarting one doesn't always do the trick, but restarting all of the pods invariably does, like one of the apps was holding a lock but not doing anything.
Maybe you're seeing something unrelated -- what version of Microsoft.Azure.WebJobs.Extensions.Storage
are you using?
We're using whatever version is installed in the latest Azure Functions 2.0 JS Runtime Docker image. How would I extract that version?
I suspect you're seeing something unrelated to this issue. @soninaren -- is there a way that someone using bundles can move their version of Extensions.Storage forward to the latest?
The updated extension bundle should ship out this week. It should stay up to date from now. There is no way to move a version of extension a head of the bundle, if you are using extension bundle.
I don't know why, but this error started happen with me right now.. even my function being running at all for 8 months..
its keep getting the same error, didn't work.
Log:
[02/01/2020 20:31:45] Function started (Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650) [02/01/2020 20:31:45] Executing 'BlobQueueListener' (Reason='New queue message detected on 'dummyqueue'.', Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650) [02/01/2020 20:31:45] Trigger Details: MessageId: 6456e8e7-7e49-441d-9339-7525404df547, DequeueCount: 5, InsertionTime: 02/01/2020 20:19:52 +00:00 [02/01/2020 20:32:00] A ScriptHost error has occurred [02/01/2020 20:32:00] Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Exception while executing function: BlobQueueListener [02/01/2020 20:32:00] Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Function completed (Failure, Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650, Duration=15135ms) [02/01/2020 20:32:00] [02/01/2020 20:32:00] Executed 'BlobQueueListener' (Failed, Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650) [02/01/2020 20:32:00] mscorlib: Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Function had errors. See Azure WebJobs SDK dashboard for details. Instance ID is 'c159a1d1-ceb0-4d14-91e1-e5d76492c650' [02/01/2020 20:32:00] mscorlib: Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Message has reached MaxDequeueCount of 5. Moving message to queue 'dummyqueue-poison'.
@brettsam is there any actualization for this error?
I changed the Queue to an ServiceBus, and keep having the same error
@brettsam I had changed my application to run the host async but I am still seeing this issue. My service is crashing and restarting about 12 times a day. These are the logs I'm getting:
[11:42:21 INF] Logging configured
[11:42:22 INF] Application starting
[11:42:22 INF] LoggerFilterOptions
{
"MinLevel": "Information",
"Rules": [
{
"ProviderName": "Serilog.Extensions.Logging.SerilogLoggerProvider",
"CategoryName": null,
"LogLevel": "Trace",
"Filter": null
}
]
}
[11:42:22 INF] FunctionResultAggregatorOptions
{
"BatchSize": 1000,
"FlushTimeout": "00:00:30",
"IsEnabled": true
}
[11:42:22 INF] QueuesOptions
{
"BatchSize": 16,
"NewBatchThreshold": 8,
"MaxPollingInterval": "00:01:00",
"MaxDequeueCount": 5,
"VisibilityTimeout": "00:00:00"
}
[11:42:22 INF] SingletonOptions
{
"LockPeriod": "00:00:15",
"ListenerLockPeriod": "00:01:00",
"LockAcquisitionTimeout": "10675199.02:48:05.4775807",
"LockAcquisitionPollingInterval": "00:00:05",
"ListenerLockRecoveryPollingInterval": "00:01:00"
}
[11:42:22 INF] BlobsOptions
{
"CentralizedPoisonQueue": false
}
[11:42:22 INF] Starting JobHost
[11:42:22 INF] Found the following functions:
Company.Application.CleanupFunction.Cleanup
Company.Application.ApplicationFunction.ProcessQueueMessage
[11:42:23 INF] Job host started
Application started. Press Ctrl+C to shut down.
Hosting environment: Production
Content root path: /app/
[11:49:23 INF] The next 5 occurrences of the 'Cleanup' schedule (Cron: '0 0 1 * * *') will be:
03/05/2020 01:00:00+00:00 (03/05/2020 01:00:00Z)
03/06/2020 01:00:00+00:00 (03/06/2020 01:00:00Z)
03/07/2020 01:00:00+00:00 (03/07/2020 01:00:00Z)
03/08/2020 01:00:00+00:00 (03/08/2020 01:00:00Z)
03/09/2020 01:00:00+00:00 (03/09/2020 01:00:00Z)
[11:56:28 ERR] Singleton lock renewal failed for blob '6aaf581a4d74d37580b1ebda440a1171/Company.Application.CleanupFunction.Cleanup.Listener' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-03-04T11:53:23.638Z (184838 milliseconds ago) with a duration of 23 milliseconds. The lease period was 60000 milliseconds. Unhandled exception. Microsoft.WindowsAzure.Storage.StorageException: The lease ID specified did not match the lease ID for the blob.
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
at Microsoft.Azure.WebJobs.Host.StorageBaseDistributedLockManager.SingletonLockHandle.RenewAsync(ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host.Storage\Singleton\BlobLeaseDistributedLockManager.cs:line 348
at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.ExecuteAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonManager.cs:line 334
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.RunAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\TaskSeriesTimer.cs:line 147
at Microsoft.Azure.WebJobs.Host.Timers.WebJobsExceptionHandler.<>c__DisplayClass3_0.<OnUnhandledExceptionAsync>b__0() in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\WebJobsExceptionHandler.cs:line 55
at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
at System.Threading.ThreadHelper.ThreadStart()
Request Information
RequestID:2f884b2f-b01e-002b-531b-f2c5bd000000
RequestDate:Wed, 04 Mar 2020 11:56:28 GMT
StatusMessage:The lease ID specified did not match the lease ID for the blob.
ErrorCode:LeaseIdMismatchWithLeaseOperation
ErrorMessage:The lease ID specified did not match the lease ID for the blob.
RequestId:2f884b2f-b01e-002b-531b-f2c5bd000000
Time:2020-03-04T11:56:28.4660786Z
We're seeing this too. There seems to always be a spike in the CPU right around when this happens (it goes to 100% and stays there) but I'm not sure if that's cause or effect.
That's likely the cause -- these conflicts seem to always arise when function code is causing thread-pool starvation. The timers we use to renew locks don't fire on time and the lease can be lost, causing this error log. The recommendation is to evaluate your code for issues that may be causing thread pool starvation and try to remove them.
I've been seeing this issue occur with low CPU usage. Our timertrigger will start running on PodA, and then PodB will somehow grab the lock which causes PodA to restart. Honestly unsure of what could be going wrong here. The previous run took less than a second.
Please provide a succinct description of the issue.
When you run timer jobs with many worker roles (50+) one of those worker roles gets the 409 error from storage service and gets recycled. When this happens the exception message says "StatusMessage:The lease ID specified did not match the lease ID for the blob. ErrorCode:LeaseIdMismatchWithLeaseOperation".
Provide the steps required to reproduce the problem
10 timer jobs with 50-150 worker role instances. Every 30 minutes also one of roles will start recycling due to 409 (conflict) error.
Expected behavior
No conflict error from storage service.
Provide a description of the expected behavior.
Actual behavior
Getting 409 error.
Known workarounds
None.
Related information
Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (409) Conflict. ---> System.Net.WebException: The remote server returned an error: (409) Conflict. at Microsoft.WindowsAzure.Storage.Shared.Protocol.HttpResponseParsers.ProcessExpectedStatusCodeNoException[T](HttpStatusCode expectedStatusCode, HttpStatusCode actualStatusCode, T retVal, StorageCommandBaseb 3(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 115
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.d36.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.d d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
Request Information
RequestID:d8d712e5-0001-012c-651a-0dec95000000
RequestDate:Mon, 12 Sep 2016 17:26:41 GMT
StatusMessage:The lease ID specified did not match the lease ID for the blob.
ErrorCode:LeaseIdMismatchWithLeaseOperation
1 cmd, Exception ex) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\Common\Shared\Protocol\HttpResponseParsers.Common.cs:line 50 at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.<>c__DisplayClass33.<RenewLeaseImpl>b__32(RESTCommand
1 cmd, HttpWebResponse resp, Exception ex, OperationContext ctx) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Blob\CloudBlob.cs:line 3319 at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 306 --- End of inner exception stack trace --- at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 60 at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>cDisplayClass4.