Azure / azure-webjobs-sdk

Azure WebJobs SDK
MIT License
737 stars 358 forks source link

Singleton Listeners causing 409 (Conflict) error #822

Open jaekim0908 opened 8 years ago

jaekim0908 commented 8 years ago

Please provide a succinct description of the issue.

When you run timer jobs with many worker roles (50+) one of those worker roles gets the 409 error from storage service and gets recycled. When this happens the exception message says "StatusMessage:The lease ID specified did not match the lease ID for the blob. ErrorCode:LeaseIdMismatchWithLeaseOperation".

Provide the steps required to reproduce the problem

10 timer jobs with 50-150 worker role instances. Every 30 minutes also one of roles will start recycling due to 409 (conflict) error.

Expected behavior

No conflict error from storage service.

Provide a description of the expected behavior.

Actual behavior

Getting 409 error.

Known workarounds

None.

Related information

Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (409) Conflict. ---> System.Net.WebException: The remote server returned an error: (409) Conflict. at Microsoft.WindowsAzure.Storage.Shared.Protocol.HttpResponseParsers.ProcessExpectedStatusCodeNoException[T](HttpStatusCode expectedStatusCode, HttpStatusCode actualStatusCode, T retVal, StorageCommandBase1 cmd, Exception ex) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\Common\Shared\Protocol\HttpResponseParsers.Common.cs:line 50 at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.<>c__DisplayClass33.<RenewLeaseImpl>b__32(RESTCommand1 cmd, HttpWebResponse resp, Exception ex, OperationContext ctx) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Blob\CloudBlob.cs:line 3319 at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 306 --- End of inner exception stack trace --- at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 60 at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>cDisplayClass4.b3(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 115 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.d36.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult() at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.dd.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart() Request Information RequestID:d8d712e5-0001-012c-651a-0dec95000000 RequestDate:Mon, 12 Sep 2016 17:26:41 GMT StatusMessage:The lease ID specified did not match the lease ID for the blob. ErrorCode:LeaseIdMismatchWithLeaseOperation

stephlocke commented 7 years ago

I'm getting this issue with azure functions v0.6, using blob triggers when using multiple s2 instances (can't use dynamic service plan due to region availability)

2016-09-26T13:15:11.513 An unhandled exception has occurred. Host is shutting down.
2016-09-26T13:15:11.513 Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (409) Conflict. ---> System.Net.WebException: The remote server returned an error: (409) Conflict.
   at Microsoft.WindowsAzure.Storage.Shared.Protocol.HttpResponseParsers.ProcessExpectedStatusCodeNoException[T](HttpStatusCode expectedStatusCode, HttpStatusCode actualStatusCode, T retVal, StorageCommandBase`1 cmd, Exception ex)
   at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.<>c__DisplayClass33.<RenewLeaseImpl>b__32(RESTCommand`1 cmd, HttpWebResponse resp, Exception ex, OperationContext ctx)
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult)
   --- End of inner exception stack trace ---
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result)
   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass4.<CreateCallbackVoid>b__3(IAsyncResult ar)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.<ExecuteAsync>d__37.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.<RunAsync>d__d.MoveNext()
Request Information
RequestID:e3b20f97-0001-0019-22f7-17efc9000000
RequestDate:Mon, 26 Sep 2016 13:15:03 GMT
StatusMessage:The lease ID specified did not match the lease ID for the blob.
ErrorCode:LeaseIdMismatchWithLeaseOperation
2016-09-26T13:15:11.950 Stopping Host
mathewc commented 7 years ago

@brettsam this is an important one to fix ASAP. @jaekim0908 is internal, and can help us with additional repro details if needed, and he can help test the fix once its ready.

brettsam commented 7 years ago

@stephlocke -- are you seeing this regularly? If so, would you be able to turn on Verbose logging briefly to see if we can catch it again? Docs on the settings are here: https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json

Your settings would look like this:

{
    "tracing": {      
      "consoleLevel": "verbose",
      "fileLoggingMode": "always"
    }
}
stephlocke commented 7 years ago

Will send more stuff along, but since pushing the new host tracing. I'm getting a lot of this:

2016-10-03T13:52:39.526 Host lock lease renewed.
2016-10-03T13:52:38.479 Host instance '5bb77d899e9648e5d0604a9eebcb15e1' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:39.389 Host instance 'df677cfbf13e4b3330b3566dba85bd22' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:38.540 Host instance 'f84d303d334bcf4bf213ba8be2b0c052' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:39.379 Host instance '4053ca815b15b0baaf2d272c606d5637' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:52:39.035 Host instance '6e241969103c13eca8aaab7550ad6afb' failed to acquire host lock lease: Another host has an active lease.
2016-10-03T13:53:02.493 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.493 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.803 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.803 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.182 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.182 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.666 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.666 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.801 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.801 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.464 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.464 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:03.138 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:03.138 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.545 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.545 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
2016-10-03T13:53:02.858 Waiting for Singleton lock (uketl/WebJobs.Internal.Blobs.Listener)
2016-10-03T13:53:02.858 Unable to acquire Singleton lock (uketl/WebJobs.Internal.Blobs.Listener).
brettsam commented 7 years ago

Thanks -- that's expected. When you have a Singleton trigger (which Timers are -- you don't want the same timer trigger firing on all of your instances), each host will try to acquire a lease on a specific blob to ensure it's the only one running. Since only one host can have the lease, the others log that they don't have it, which is perfectly fine. The Verbose logs log this a lot -- which we actually have a bug tracking to minimize since it's generally not useful.

In this case, though it may be :-)

I believe the error you're seeing is because one host has a lease and when it goes to renew that lease, sees another host has stolen it away. That shouldn't happen. If it does, it should be very rare and coincide with some kind of issue in Storage where we can't communicate properly with the blob. If you see the 409 StorageException, I'll want to look up at the timestamps of those 'Singleton lock' messages to see what may have happened.

brettsam commented 7 years ago

BTW -- I just noticed you are using Blob triggers yet I mentioned Timer triggers. Blob triggers are Singletons as well.

stephlocke commented 7 years ago

Here is a slightly older one from the logs whilst I dig out /wait for another on verbose

2016-10-01T16:06:10.909 An unhandled exception has occurred. Host is shutting down.
2016-10-01T16:06:10.909 Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (409) Conflict. ---> System.Net.WebException: The remote server returned an error: (409) Conflict.
   at Microsoft.WindowsAzure.Storage.Shared.Protocol.HttpResponseParsers.ProcessExpectedStatusCodeNoException[T](HttpStatusCode expectedStatusCode, HttpStatusCode actualStatusCode, T retVal, StorageCommandBase`1 cmd, Exception ex)
   at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.<>c__DisplayClass33.<RenewLeaseImpl>b__32(RESTCommand`1 cmd, HttpWebResponse resp, Exception ex, OperationContext ctx)
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult)
   --- End of inner exception stack trace ---
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result)
   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass4.<CreateCallbackVoid>b__3(IAsyncResult ar)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.<ExecuteAsync>d__37.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.<RunAsync>d__d.MoveNext()
Request Information
RequestID:8e7d23b6-0001-001c-15fd-1b1bb6000000
RequestDate:Sat, 01 Oct 2016 16:06:10 GMT
StatusMessage:The lease ID specified did not match the lease ID for the blob.
ErrorCode:LeaseIdMismatchWithLeaseOperation

Not sure if these 409s are problem with the host or with the storage account when I'm trying to interact with a table

2016-10-03T13:23:01.652 A ScriptHost error has occurred
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Exception while executing function: Functions.blobtoeventhub. mscorlib: Exception has been thrown by the target of an invocation. Microsoft.WindowsAzure.Storage: The remote server returned an error: (409) Conflict. System: The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Exception while executing function: Functions.blobtoeventhub
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Executed: 'Functions.blobtoeventhub' (Failed)
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
2016-10-03T13:23:01.652 Function had errors. See Azure WebJobs SDK dashboard for details. Instance ID is 'd6568a31-6b81-49cc-b264-76eba2dd2bdd'
2016-10-03T13:23:01.652 The remote server returned an error: (409) Conflict.
brettsam commented 7 years ago

@stephlocke -- The 409s coming directly from functions are likely because you're trying to create something in storage with a name that already exists. I'm investigating the one that comes from the host itself -- with RenewLeaseCommand in the call stack.

Also, you can remove "fileLoggingMode": "always" from your host.json if you want to save some of your file space. That'll put it back to the default behavior of debugOnly. We now have runtime logs sent to our backend so I am able to see your logs myself and there's no need for you to keep all of those (unless you want them logging all the time). I'll keep checking to see if I can catch this error, but so far I haven't seen it since you turned on verbose logs.

brettsam commented 7 years ago

@stephlocke -- I realized I actually do have your previous verbose logs when you were seeing this crash, so feel free to remove those settings from your host.json.

What I see happening is:

  1. One of your instances, (A) has the lease on the 'lock blob'.
  2. By design, all of the other instances are periodically checking whether that lease has expired.
  3. For some reason, (A) seems to go silent long enough for the 'lock blob' lease to expire. I see no logs coming from this instance.
  4. Another instance, (B), is able to acquire the lease on the 'lock blob'.
  5. After about 3 minutes of silence, (A) pops back up and attempts to renew the 'lock blob'. Because (B) already has it, the 409 is thrown and the host restarts.

Things are behaving as we would expect, but I'd like to figure out what could cause (A) to go silent and then come back?

stephlocke commented 7 years ago

Depending on what you mean by silence, this could be because the functions may have been processing large files at the time so none may have been writing to the console for a few minutes. A lot of the older files processed were 30mb files that took 30+ mins to process.

brettsam commented 7 years ago

In this case, by 'silence', I mean that I"m not seeing any of the Renewing Singleton lock (functionapp/WebJobs.Internal.Blobs.Listener) logs that our host writes out with every renewal. That should be happening quite often, and then they just stop for 3 minutes.

I did look at this instance and see that the CPU percentage was quite high (> 90%) just before this. I wonder if our lease-renewal operation just wasn't able to fire due to lack of resources. I'll keep investigating to see if I can repro this and if there's a better way to handle it.

stephlocke commented 7 years ago

I'm loathe to double my costs by jumping to an S3- I'm filling in a backlog of data so I'm currently running 10 x S2 instances!

brettsam commented 7 years ago

I'm going to see if I can figure out exactly where we're falling apart here under load and if there's a better way to catch this. I'll track my findings here.

davidebbo commented 7 years ago

For reference, this is also on https://social.msdn.microsoft.com/Forums/en-US/08f85617-1767-45d4-b6f4-d8344a835a6f/stability-problem-for-function-in-uk-west-region?forum=AzureFunctions (thread from @stephlocke).

stephlocke commented 7 years ago

My instance has now gone totally bananas. My live stream from one function is showing logs from the other. I'm thinking that I'm going to have to look at whether I can make a new instance that has less backlog to process from the beginning. Hopefully just having throughput to cope with, will reduce the voltatility.image Obviously, it's not GA so buyer beware but it's rather frustrating that the EU and US instances worked fine when they were scaled up, and that I mainly seem to be having troubles on just one data centre.

It also seems pretty weird tbh that this error now happens so much, most often between midnight and 6am that I get total downtime during those periods. image

brettsam commented 7 years ago

@stephlocke -- one thing you may want to try while we work on a solution is reduce the number of items that your instance dequeues at once. This will obviously slow things down, but it may make them more stable. You can do this by changing the queues.batchSize in host.json -- https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json. Try cutting it in half to 8 and see if that offers any improvement.

Note: even though you're using blobs, the invocation is still managed by a queue in the background, which is why this setting is relevant.

stephlocke commented 7 years ago

I actually had it set to the max of 32 – it and the scan frequency seemed the main ways to get it doing a decent amount of instances across 10 machines when I was at peak volume.

Reduced down to 8 for now. Will let you know how it gets on.

stephlocke commented 7 years ago

Still fracked with reduced queue image

abatishchev commented 7 years ago

I'm facing the same error when debug locally a webjob having several scheduled singleton functions.

brettsam commented 7 years ago

@abatishchev -- some questions:

  1. How exactly are you using Singleton here? I'm assuming you're using [Singleton("key")] on multiple functions?
  2. How long do your functions run?
  3. Are they CPU-intensive?
abatishchev commented 7 years ago
  1. Both [Singleton] and [Singleton("{prop}"] (bound from one of the func args). There former are used for a while now and the later were introduced lately so I suspect them. Luckily for us this errors doesn't happen in Azure, only during local debugging. But occurs randomly and kills the console host what is super annoying.
  2. From fews minutes to few days (scans thru 150M records in db and applies business logic)
  3. Yes, some of them are both I/O (SQL Azure, HTTP calls) and CPU intensive. Especially the main one, the same I'm currently debugging.
stephlocke commented 7 years ago

On my ongoing stability issue, I'm not sure having more than 2m blobs processed and sitting around in archives is good for the performance of the indexing etc so I'm just doing an archiving process to delete everything older than a fortnight which takes off about a 1m blobs and 2m blob read receipts from the function. I'll then do it in finer grains until the amount is fairly small but my backlog extends ~a week right now. Will let you know whether this helps.

*Update 2016-10-19 Turns out deleting that many blobs was a bigger technical challenge than anticipated. Still ongoing now (via a Function!) but for the past 24hrs my Functions have been up the entire time. So nice to see it processing properly! image

brettsam commented 7 years ago

@stephlocke -- We have some proposed fixes that may improve your situation, but I'd like to make sure I can reliably repro what you're seeing to ensure our approach will work. Would you mind sending me a direct email so we can get some more details on your function(s)? You can find my email on my profile -- https://github.com/brettsam.

MikeStall commented 7 years ago

I filed https://github.com/Azure/azure-webjobs-sdk/issues/899 to track a more general related issue.

mathewc commented 7 years ago

Clearing the milestone so we can retriage/prioritize this.

alohaninja commented 6 years ago

seeing this 409 LeaseIdMismatchWithLeaseOperation regularly with our blob trigger. We have very low volume and only (2) instances.

2017-11-28T03:28:58.536 Host lock lease acquired by instance ID '66525d2801dbf933e2e11518b127855f'.
info: Host.General[0]
      Host lock lease acquired by instance ID '66525d2801dbf933e2e11518b127855f'.
2017-11-28T03:29:06.300 A ScriptHost error has occurred
2017-11-28T03:29:06.300 System.InvalidOperationException : Singleton lock renewal failed for blob 'locks/my-function/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 0001-01-01T00:00:00Z (-2147483648 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.
2017-11-28T03:29:06.309      'locks/my-function/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 0001-01-01T00:00:00Z (-2147483648 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.
2017-11-28T03:29:06.312 Failed to renew host lock lease: Another host has acquired the lease. The last successful renewal completed at 2017-11-28T03:28:54.229Z (12080 milliseconds ago) with a duration of 780 milliseconds.

Seems to freeze up our host and we have to restart the function to get it working again.

StingyJack commented 5 years ago

A few years later and still getting this when debugging a brand new web job.

The WebJob has two functions; a schedule/timer trigger that runs once a day as well as a queue based trigger for reprocessing prior days. Both functions have the [Singleton] applied, with the scheduled function having no singleton params, and the queue function with a string value. The intent is to permit the functions to run at the same time, but only one instance of each function can be executing at any given time.

When using the queue based trigger to run through this year's days to seed the data backlog, the webjob will crash with these exception details.

Microsoft.WindowsAzure.Storage.StorageException
  HResult=0x80131500
  Message=.
  Source=Microsoft.WindowsAzure.Storage
  StackTrace:
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result)
   at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.EndRenewLease(IAsyncResult asyncResult)
   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass7.<CreateCallbackVoid>b__5(IAsyncResult ar)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.BlobLeaseDistributedLockManager.SingletonLockHandle.<RenewAsync>d__17.MoveNext()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.<ExecuteAsync>d__4.MoveNext()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.<RunAsync>d__14.MoveNext()
   at Microsoft.Azure.WebJobs.Host.Timers.WebJobsExceptionHandler.<>c__DisplayClass3_0.<OnUnhandledExceptionAsync>b__0()
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()

Inner Exception 1:
WebException: The remote server returned an error: (409) Conflict.

I'm going to change the batch size down to 1 from the default. Its a singleton so I would have expected the processing and the dequeueing to be serial, but for some reason it is dequeueing 8 or 10 or something and probably creating more of a chance to have some kind of lease issue.

@brettsam - will there be any fix coming for this? And specifically a fix that does not require updating to the 3,x SDK (package reference is a breaking change to usability, the non declarative SDK file format makes project task automation more cumbersome and is asking for web site project era trouble of mystery files, and config settings dont belong in a place where we cannot also give instructions about what the values do and what values are permitted )

brettsam commented 5 years ago

To be clear -- there's not a fix that we can realistically apply here without a major change to how we invoke functions. We use timers to renew singleton locks and those timers run in the same process as the WebJob. If your WebJob functions use up enough resources such that the timers cannot fire in a reasonable time, the renewals will not occur and you'll end up with this error.

It sounds like your issue is with [Singleton] -- applying this to your function should restrict it to only a single invocation running at a time. Can you open a separate issue for that?

StingyJack commented 5 years ago

The [Singleton] is applied to both functions in the webjob, and its not a concurrency problem. Only one function or function instance is running at a time. The only "problem" with the singleton is that I didn't expect it to pull a batch size > 1 from a queue. Knocking down the batch size to 1 and re-running the entire test again did not produce a 409 error and that error had been easily repeatable when running the test with the default batch size.

Sounds like you have a timer vs timeout race with the lock renewal, Why not just set the timer to fire at an interval that will give renewal enough time to act if there is a moderate amount of resource contention?

brettsam commented 4 years ago

@hisuwh - let's continue the discussion here as your issue doesn't apply to Functions (the primary host lock is a Functions, not WebJobs concept). Do you have any logs you can share? Or would you rather do this via email, where we can get some better details, including code? If so, feel free to email me -- my microsoft email is in my github profile.

hisuwh commented 4 years ago

The only relevant log I can see is the one I shared before (here with some more context):

[09:34:13 INF] Job host started
Application started. Press Ctrl+C to shut down.
Hosting environment: Production
Content root path: /app/
[09:36:14 INF] The next 5 occurrences of the 'CleanupFunction.Cleanup' schedule (Cron: '0 0 1 * * *') will be:
10/03/2019 01:00:00+00:00 (10/03/2019 01:00:00Z)
10/04/2019 01:00:00+00:00 (10/04/2019 01:00:00Z)
10/05/2019 01:00:00+00:00 (10/05/2019 01:00:00Z)
10/06/2019 01:00:00+00:00 (10/06/2019 01:00:00Z)
10/07/2019 01:00:00+00:00 (10/07/2019 01:00:00Z)

[09:44:18 ERR] Singleton lock renewal failed for blob '6aaf581a4d74d37580b1ebda440a1171/Company.Application.AppProcess.CleanupFunction.Cleanup.Listener' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-10-02T09:41:14.402Z (184439 milliseconds ago) with a duration of 26 milliseconds. The lease period was 60000 milliseconds.

Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The lease ID specified did not match the lease ID for the blob.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at Microsoft.Azure.WebJobs.Host.StorageBaseDistributedLockManager.SingletonLockHandle.RenewAsync(ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host.Storage\Singleton\BlobLeaseDistributedLockManager.cs:line 347
   at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.ExecuteAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonManager.cs:line 332
   at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.RunAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\TaskSeriesTimer.cs:line 147
   at System.Threading.Thread.ThreadMain_ThreadStart()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

As you can see it blows up before any functions have started running. So as I said I'm not sure that any code I'm in control of can be impacting this.

brettsam commented 4 years ago

Is this WebJobs v3? Can you share your function and startup code? Or email it to me?

hisuwh commented 4 years ago

Yes Microsoft.Azure.Webjobs version 3.0.13 to be precise.

This is my function

    public class CleanupFunction
    {
        private const string OneAmEveryDay = "0 0 1 * * *";
        private readonly IFileCleanup fileCleanup;
        private readonly ILogger logger;

        /// <summary>
        /// Initializes a new instance of the <see cref="CleanupFunction"/> class.
        /// </summary>
        /// <param name="fileCleanup">The injected <see cref="IFileCleanup"/> instance.</param>
        public CleanupFunction(IFileCleanup fileCleanup)
        {
            this.fileCleanup = fileCleanup;
            this.logger = Log.ForContext<CleanupFunction>();
        }

        /// <summary>
        /// Cleans up existing blobs that are older than 90 days.
        /// </summary>
        /// <param name="timer">The Timer Info.</param>
        /// <returns>A Task.</returns>
        public async Task Cleanup([TimerTrigger(OneAmEveryDay)]TimerInfo timer)
        {
            try
            {
                using (this.logger.BeginTimedOperation("Blob cleanup"))
                {
                    await this.fileCleanup.Cleanup();
                }
            }
            catch (Exception e)
            {
                this.logger.Error(e, "Error cleaning up blobs");
            }
        }
    }

This is my Program.cs file:

    public sealed class Program
    {
        /// <summary>
        /// The main function.
        /// </summary>
        public static void Main()
        {
            var builder = new HostBuilder();

            builder.ConfigureWebJobs(b =>
            {
                b.AddAzureStorageCoreServices();
                b.AddAzureStorage();
                b.AddTimers();
            });

            builder.ConfigureAppConfiguration((context, b) =>
            {
                b.AddJsonFile("appsettings.json");
                b.AddUserSecrets<Program>();
                b.AddEnvironmentVariables();
            });

            builder.ConfigureLogging((context, b) =>
            {
                ConfigureSerilog(context);
                b.AddSerilog();
            });

            builder.ConfigureServices((context, services) =>
            {
                var storageConfig = new StorageConfiguration
                {
                    ConnectionString = context.Configuration["AzureWebJobsStorage"],
                    Prefix = context.Configuration["QueuePrefix"]
                };

                services.AddSingleton<IJobActivator, JobActivator>();

                services.AddTransient<CleanupFunction, CleanupFunction>();
                services.AddScoped<IFileCleanup, FileCleanup>();

                // ...
            });

            using (var host = builder.Build())
            {
                Log.Information("Service starting");
                try
                {
                    host.Run();
                }
                catch (Exception e)
                {
                    Log.Fatal(e, "A fatal error occured!");
                    throw;
                }

                Log.Information("Service stopping");
            }
        }

        private static void ConfigureSerilog(HostBuilderContext context)
        {
            // ...
        }
    }
StingyJack commented 4 years ago

We use timers to renew singleton locks and those timers run in the same process as the WebJob. If your WebJob functions use up enough resources such that the timers cannot fire in a reasonable time, the renewals will not occur and you'll end up with this error.

@brettsam - I glossed over this before, but EEK! Timers in windows services and other long running processes was always problematic in .NET, and it wasn't just from using "the wrong" timer class. You should strongly consider rewriting that to use something with a more reliable and predictable behavior.

brettsam commented 4 years ago

@StingyJack -- Yes, there's no great solution other than moving off of Timers completely. But you need something to renew the lease, or to move the lease renewal out-of-proc, or switch to using some external service to manage locks. All are troublesome, and since this has always been resolvable by the customer with some code changes, we've just punted it over-and-over. Any suggestions to improvements here are welcome :-)

@hisuwh -- I see you're registering some custom services. What if you back them all out so you're only registering the bare minimum? Those constructors are being called, most likely (you can set breakpoints to see), but without knowing exactly what all of your services are doing during startup, it's tough for me to diagnose. If you're able to zip up the project and mail to me I'd be glad to try to repro this myself as well. I'd also recommend running with await host.RunAsync() rather than Run(), but I'm not sure if that's the entire issue.

hisuwh commented 4 years ago

@brettsam I can't really send you the entire source code. I will give your second suggestion a try

derekrprice commented 4 years ago

Related to Azure/azure-functions-host#1864?

I'm seeing this problem with a blob trigger in a scaled up (8 pod) azure node functions runtime docker image in AKS, It runs fine for several days, then I see this error message and all 8 pods freeze up, for over 24 hours. Restarting 1 pod had no effect. Restarting another 3 pods, effectively at random, got everything moving again, for all 8 pods. Maybe I released a lock? The backup included hundreds of unprocessed blobs, so I don't think the problem could be a single lock on even a small subset of those blobs.

derekrprice commented 4 years ago

Anyone have any idea what may be causing this? We're still seeing it. It appears to occur pretty consistently and once we see the error, processing in our 6 concurrently running containers stops, usually for many hours.

brettsam commented 4 years ago

@derekrprice -- It's caused by thread starvation. The timers that renew locks are not able to get any processor time, so they fail.

The main recommendations are to make sure everything is async everywhere possible and/or reduce function throughput (the specifics depend on what your triggers are and what your functions are doing).

derekrprice commented 4 years ago

@brettsam Our throughput is fairly high but not ridiculously so, a few thousand invocations a day, but thread starvation doesn't make sense given the pattern that I am seeing. This is a fairly simple log processor based on blob triggers. IOT devices periodically upload logs, which take the functions app only a few seconds to process and save as JSON to a NoSQL DB. We have six pods running concurrently, and all I/O calls are asynchronous.

The processor appears to run anywhere from an hour to about a day, then we see this error message and everything stops. None of the six apps are processing anything at this point, so how could a thread be starved? If I restart the 6 pods, processing resumes on the backlog immediately. Restarting one doesn't always do the trick, but restarting all of the pods invariably does, like one of the apps was holding a lock but not doing anything.

brettsam commented 4 years ago

Maybe you're seeing something unrelated -- what version of Microsoft.Azure.WebJobs.Extensions.Storage are you using?

derekrprice commented 4 years ago

We're using whatever version is installed in the latest Azure Functions 2.0 JS Runtime Docker image. How would I extract that version?

brettsam commented 4 years ago

I suspect you're seeing something unrelated to this issue. @soninaren -- is there a way that someone using bundles can move their version of Extensions.Storage forward to the latest?

soninaren commented 4 years ago

The updated extension bundle should ship out this week. It should stay up to date from now. There is no way to move a version of extension a head of the bundle, if you are using extension bundle.

adrianteixeira commented 4 years ago

I don't know why, but this error started happen with me right now.. even my function being running at all for 8 months..

its keep getting the same error, didn't work.

Log:

[02/01/2020 20:31:45] Function started (Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650) [02/01/2020 20:31:45] Executing 'BlobQueueListener' (Reason='New queue message detected on 'dummyqueue'.', Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650) [02/01/2020 20:31:45] Trigger Details: MessageId: 6456e8e7-7e49-441d-9339-7525404df547, DequeueCount: 5, InsertionTime: 02/01/2020 20:19:52 +00:00 [02/01/2020 20:32:00] A ScriptHost error has occurred [02/01/2020 20:32:00] Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Exception while executing function: BlobQueueListener [02/01/2020 20:32:00] Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Function completed (Failure, Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650, Duration=15135ms) [02/01/2020 20:32:00] [02/01/2020 20:32:00] Executed 'BlobQueueListener' (Failed, Id=c159a1d1-ceb0-4d14-91e1-e5d76492c650) [02/01/2020 20:32:00] mscorlib: Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Function had errors. See Azure WebJobs SDK dashboard for details. Instance ID is 'c159a1d1-ceb0-4d14-91e1-e5d76492c650' [02/01/2020 20:32:00] mscorlib: Exception while executing function: BlobQueueListener. Microsoft.WindowsAzure.Storage: O servidor remoto retornou um erro: (409) Conflito. [02/01/2020 20:32:00] Message has reached MaxDequeueCount of 5. Moving message to queue 'dummyqueue-poison'.

adrianteixeira commented 4 years ago

@brettsam is there any actualization for this error?

I changed the Queue to an ServiceBus, and keep having the same error

hisuwh commented 4 years ago

@brettsam I had changed my application to run the host async but I am still seeing this issue. My service is crashing and restarting about 12 times a day. These are the logs I'm getting:

[11:42:21 INF] Logging configured
[11:42:22 INF] Application starting
[11:42:22 INF] LoggerFilterOptions
{
  "MinLevel": "Information",
  "Rules": [
    {
      "ProviderName": "Serilog.Extensions.Logging.SerilogLoggerProvider",
      "CategoryName": null,
      "LogLevel": "Trace",
      "Filter": null
    }
  ]
}
[11:42:22 INF] FunctionResultAggregatorOptions
{
  "BatchSize": 1000,
  "FlushTimeout": "00:00:30",
  "IsEnabled": true
}
[11:42:22 INF] QueuesOptions
{
  "BatchSize": 16,
  "NewBatchThreshold": 8,
  "MaxPollingInterval": "00:01:00",
  "MaxDequeueCount": 5,
  "VisibilityTimeout": "00:00:00"
}
[11:42:22 INF] SingletonOptions
{
  "LockPeriod": "00:00:15",
  "ListenerLockPeriod": "00:01:00",
  "LockAcquisitionTimeout": "10675199.02:48:05.4775807",
  "LockAcquisitionPollingInterval": "00:00:05",
  "ListenerLockRecoveryPollingInterval": "00:01:00"
}
[11:42:22 INF] BlobsOptions
{
  "CentralizedPoisonQueue": false
}
[11:42:22 INF] Starting JobHost
[11:42:22 INF] Found the following functions:
Company.Application.CleanupFunction.Cleanup
Company.Application.ApplicationFunction.ProcessQueueMessage

[11:42:23 INF] Job host started
Application started. Press Ctrl+C to shut down.
Hosting environment: Production
Content root path: /app/
[11:49:23 INF] The next 5 occurrences of the 'Cleanup' schedule (Cron: '0 0 1 * * *') will be:
03/05/2020 01:00:00+00:00 (03/05/2020 01:00:00Z)
03/06/2020 01:00:00+00:00 (03/06/2020 01:00:00Z)
03/07/2020 01:00:00+00:00 (03/07/2020 01:00:00Z)
03/08/2020 01:00:00+00:00 (03/08/2020 01:00:00Z)
03/09/2020 01:00:00+00:00 (03/09/2020 01:00:00Z)

[11:56:28 ERR] Singleton lock renewal failed for blob '6aaf581a4d74d37580b1ebda440a1171/Company.Application.CleanupFunction.Cleanup.Listener' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-03-04T11:53:23.638Z (184838 milliseconds ago) with a duration of 23 milliseconds. The lease period was 60000 milliseconds. Unhandled exception. Microsoft.WindowsAzure.Storage.StorageException: The lease ID specified did not match the lease ID for the blob.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at Microsoft.Azure.WebJobs.Host.StorageBaseDistributedLockManager.SingletonLockHandle.RenewAsync(ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host.Storage\Singleton\BlobLeaseDistributedLockManager.cs:line 348
   at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.ExecuteAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonManager.cs:line 334
   at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.RunAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\TaskSeriesTimer.cs:line 147
   at Microsoft.Azure.WebJobs.Host.Timers.WebJobsExceptionHandler.<>c__DisplayClass3_0.<OnUnhandledExceptionAsync>b__0() in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\WebJobsExceptionHandler.cs:line 55
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ThreadHelper.ThreadStart()
Request Information
RequestID:2f884b2f-b01e-002b-531b-f2c5bd000000
RequestDate:Wed, 04 Mar 2020 11:56:28 GMT
StatusMessage:The lease ID specified did not match the lease ID for the blob.
ErrorCode:LeaseIdMismatchWithLeaseOperation
ErrorMessage:The lease ID specified did not match the lease ID for the blob.
RequestId:2f884b2f-b01e-002b-531b-f2c5bd000000
Time:2020-03-04T11:56:28.4660786Z
drewdelano commented 2 years ago

We're seeing this too. There seems to always be a spike in the CPU right around when this happens (it goes to 100% and stays there) but I'm not sure if that's cause or effect.

brettsam commented 2 years ago

That's likely the cause -- these conflicts seem to always arise when function code is causing thread-pool starvation. The timers we use to renew locks don't fire on time and the lease can be lost, causing this error log. The recommendation is to evaluate your code for issues that may be causing thread pool starvation and try to remove them.

sm-cdecker commented 1 year ago

I've been seeing this issue occur with low CPU usage. Our timertrigger will start running on PodA, and then PodB will somehow grab the lock which causes PodA to restart. Honestly unsure of what could be going wrong here. The previous run took less than a second.