Closed alohaninja closed 6 years ago
So I found another workaround that doesn't require re-creating the entire lease collection. If you identify the documents in your lease collection that contain state=1
and Owner=null
, you should set the ContinuationToken=null
. Here is the before/after and the final state once the lease continued to get out of the errored state. Unsure why the ContinuationToken
gets out of whack.
Partition change feed trigger stops sending input triggers to azure function. state=1
and Owner=null
.
{
"state": 1,
"PartitionId": "7",
"Owner": null,
"ContinuationToken": "\"245895\"",
"SequenceNumber": 4726
}
Update the state=2
(healthy) and ContinuationToken=null
.
{
"state": 2,
"PartitionId": "7",
"Owner": null,
"ContinuationToken": null,
"SequenceNumber": 4726
}
ContinuationToken=<new_value>
, Owner=<new_guid>
and SequenceNumber
increments.
{
"state": 2,
"PartitionId": "7",
"Owner": "e40a9970-6f13-4fe3-9581-7e5306d982e5",
"ContinuationToken": "\"246479\"",
"SequenceNumber": 4727
}
After modifying the lease collection partition document, change feed events are no longer hung. Hoping someone can shed some light on why this happens. Seen this with numerous change feed functions we are using.
Downside is that any events that occurred when the hung state occurred are lost (since we reset the continuation token to get back to a healthy state). This is a major reliability hurdle for us (and the 2nd such critical issue we've identified with cosmos triggers)
We have numerous change feeds 10+ for various eventing and ALL of them failed at the same ContinuationToken
for the same partition. We had to perform the above workaround 10x for each lease collection.
From a discovery perspective, we noticed that when you disable the Azure Function (set disabled=true
in function.json
) - the state in the lease collection changes all documents to state=1
and Owner=null
(function host has stopped).
Wondering if this hang has something to do with the lease being paused and unable to resume after a function has been enabled/disabled. Odd thing is that some partitions have their state stuck (state=1
) while others do not (state=2
) in the same partition db lease collection even after the function has been re-enabled.
@ealsur - this probably needs to be in the extensions repo that has cosmos db binding in it. Could you investigate?
@alohaninja thanks for the detailed information. A couple of more questions:
PartitionId
(attribute in the lease)?The Trigger implements the Change Feed Processor Library, according to its source code, the state 1 means Available to be picked up. The Library will attempt to pickup available leases automatically.
The Library uses the Change Feed which consumes RUs to be read, if the collection is being rate limited, then the reads of the Feed might be receiving a 429 for a particular partition range (in case of Hot Partitions).
@ealsur here are the answers to your inquiry. We are continuing to see business interruptions with this design - assuming that every time we exhaust RU/s we seem to get 1-2 partitions that get stuck not processing change feed responses. At this point, I'm not 100% certain it's related to RU/s but I can't seem to find anything else that would be causing invalid leases. Would be nice to review the logs of the Change Feed Processor Library we use.
state=1
.PartitionId
across all change feeds when this happens. However, the ContinuationToken
is not always the same, it's slightly different. Subsequent outages will not be the same PartitionId
- it fluctuates between outage instances. At this point because this is happening almost daily (it happened once already this morning) we've written a script to search all (8) lease collections (3x cosmos instances for our test environments) to update the invalid lease state so that messages will be processed. I haven't automated it for running as I'm trying to correlate it and track the outages, but I may have to if we can't get this fixed soon. Spending alot of time watching the change feed leases to make sure they are properly processing messages.
Perhaps this issue is related to an existing race condition with cosmos change feed library and isn't related to RU/s at all?
@alohaninja Could you provide with some account information like Subscription ID, account name and collection where you are experiencing this? Lease collection name if possible too. I want to see if I can access any logs or metrics on our side.
Do you see in the Portal metrics under the Throughput tab, 429s in the collection being monitored?
It is weird that all your Function instances are stopping at the same time. If it was the case of the linked issue, changing the lease state would not work, since the Observer is stopped.
@mkolt Could the Processor be dropping the lease (the one for the partition that seems to be high on RUs), changing it's state to 1 (Available), removing the Owner and never picking it back up?
@ealsur There are usually 429s in the collection but I'm not convinced that's it - I suspect it could be related to load, but I don't have an alert when it happens yet to correlate the timing. The timestamp for the document in the lease database continues to be updated even when it is in the hung state (state=1
) so I don't know when exactly it goes offline.
Do you have an email I can send you the subId/account/collection details? Prefer not to make it public.
@alohaninja Sure thing: maquaran{at}microsoft.com
The lease could be dropped (owner set to empty, state = available) due to multiple reasons, the lease can also expire if one of processors dies, etc. All this is totally fine. Any such lease lease will be picked up next time any processor does lease acquire, this happens periodically with ChangeFeedOptions.LeaseAcquireInterval.
From: Matias Quaranta [mailto:notifications@github.com] Sent: Thursday, January 11, 2018 10:11 AM To: Azure/azure-webjobs-sdk-script azure-webjobs-sdk-script@noreply.github.com Cc: Michael Koltachev mkolt@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-webjobs-sdk-script] Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers (#2259)
@alohaninjahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falohaninja&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396420472&sdata=C16vq%2Fqgih7scEc4fcSnoeY9SRLQv%2BWaiwq4YKBebGA%3D&reserved=0 Could you provide with some account information like Subscription ID, account name and collection where you are experiencing this? Lease collection name if possible too. I want to see if I can access any logs or metrics on our side.
Do you see in the Portal metrics under the Throughput tab, 429s in the collection being monitored?
It is weird that all your Function instances are stopping at the same time. If it was the case of the linked issue, changing the lease state would not work, since the Observer is stopped.
@mkolthttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmkolt&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396420472&sdata=UzxIYYKtbFOzY9sQ1TCucz79TBmjVbGgE2yzXzbyOFI%3D&reserved=0 Could the Processor be dropping the lease (the one for the partition that seems to be high on RUs), changing it's state to 1 (Available), removing the Owner and never picking it back up?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-webjobs-sdk-script%2Fissues%2F2259%23issuecomment-357013027&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396430480&sdata=mKvHbQLzPU0hna8Or%2BZWmsUcAqZwa2EXp%2FlzpHYcYwM%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQnQBBM4p6Z1O6GvoBxqLtvWu0EbjBK_ks5tJk6cgaJpZM4RPHkF&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396430480&sdata=EcZL34%2F6MV5pUrVwOSzVcUs7gmWaO3et4kA3LVpi1mM%3D&reserved=0.
Thanks @mkolt ! @alohaninja Is your App Service using multiple instances? Is the amount of instances changing over time (auto-scale or manual-scale)?
@ealsur emailed you the subscription details, please share with @mkolt .
Our App Service has (2) fixed S3 instances, we are NOT using auto-scale.
@mkolt - the lease partition document never resumes the lease, it remains stuck indefinitely.
The statement Any such lease will be picked up next time any processor does lease acquire, this happens periodically with ChangeFeedOptions.LeaseAcquireInterval.
doesn't seem to match what we see. The lease is never renewed/re-aquired.
@alohaninja can you try using Change Feed processor directly (with different lease collection)? The idea is to see whether standalone Change Feed processor would hit same issue.
@alohaninja When the lease gets lost:
@mkolt I'll try to setup the change feed processor directly using a different lease collection, have not tried that yet.
@ealsur reponse:
Do the other leases have Owner and are the Owner values are divided into 2 groups? The idea is that if you have an App Service with 2 instances, the leases should be divided in 2 owners. I want to see if when the issue is hit, each owner has the same # of leases or not.
_The other leases have Owner
values and are typically split between the 25 partitions. Just now it happened and 11 are assigned to one owner (1,11,13,14,15,17,18,20,21,23), and 13 to another owner (0,2,3,4,5,6,7,8,9,10,12,16,22). The PartitionId
that fails is not always the same, this time it was 24, yesterday it was 2. After resetting the state
& ContinuationToken
manually, the Owner
with 11 picked up the new lease just fine._
Do the rest of the Partitions keep triggering or does the Function stop completely? The other partitions continue triggering events just fine, so it's just a partial failure of the lease for a specific partition(s). This makes it especially hard to notice because the function still processes events for other partitions.
Note: When it failed this morning, no RU/s were exhausted, so it doesn't seem related to exceeding capacity, perhaps it's load or a race condition?
If the other partitions keep working, then it's not the case of the linked issue, that issue halted the entire Observer (no more function executions).
It's a very weird case, since the Observer keeps processing the other PK Ranges.
@ealsur I finally got around to setting up a demo app with partition observers using the Change Feed Processor Library sample code, just waiting to see if they handle the lease interruptions the same way. Have them running now and hope to report back soon - running just a single change feed event host for each monitored collection.
For now, I just created another watchdog function app the auto-fixes the broken leases collection partition states that polls every minute using the script previously created. The issue is still happening regularly across multiple collections/function apps. This also logs to application insights so we can better track when the outages occur.
@ealsur The demo app is also seeing the lease partition failures, I was able to reproduce this with the Change Feed Processor Library The error reason ChangeFeedObserverCloseReason.Unknown
it reports is not useful though, so I am now monitoring the collections using trace output from the Change Feed Processor Library for more details.
Here is the log output the DocumentFeedObserver
reported...
Change feed: PartitionId 22 total 699 doc(s)
72ac4b93bcb24e299262166a9be8c864
Observer closed, 22
Reason for shutdown, Unknown
Observer opened for partition Key Range: 22
Observer closed, 22
Reason for shutdown, Unknown
The partition observer closed mid-stream while processing (22) docs. It seems that it opened and closed the partition like this 8 times until the state was reset. The only reason it resumed is because the watchdog we have that runs every minute reset the invalid state document in the cosmos lease db.
Very interesting @alohaninja , so the issue is not directly related to the Azure Function, but with the underlying library.
@mkolt Is there any extra logging settings that can be added on his sample app that could unearth the real reason behind the Unknown? Could it be a Max Batch Size issue?
@ealsur here is the exception I captured from the trace output of Change Feed Processor Library. This error occurs repeatedly in the change feed library until we reset the state of the partitioned lease in cosmos db.
The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.
[2018/01/22 10:43:19.398] [Error] Partition 7 exception: Microsoft.Azure.Documents.BadRequestException: Message: {"Errors":["The page size is too big to preserve items that belong to same transaction. Reduce page size and try again."]}
ActivityId: a9099787-db3f-405b-8abd-757239864554, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats:
ResponseTime: 2018-01-22T16:43:19.3455416Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 306756, GlobalCommittedLsn: 306755, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1
at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__2`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Client.DocumentClient.<ReadFeedAsync>d__254.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<GetFeedResponseAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<ReadDocumentChangeFeedPrivateAsync>d__15`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 326
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 379
[2018/01/22 10:43:19.400] [Information] Releasing lease for partition 7 due to an error, reason: Unknown!
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' successfully removed PartitionId '7' with lease token '"b400eba2-0000-0000-0000-5a66149c0000"' from currently owned partitions.
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' closing event processor for PartitionId '7' and lease token '"b400eba2-0000-0000-0000-5a66149c0000"' with reason 'Unknown'
[2018/01/22 10:43:19.413] [Information] Host 'changedemo' releasing partition 7...
[2018/01/22 10:43:19.414] Observer closed, Unknown
[2018/01/22 10:43:19.414] Reason for shutdown, Unknown
[2018/01/22 10:43:19.415] [Information] Partition 7: worker finished!
Which ChangeFeedOptions.MaxItemCount do you use?
From: Aaron Hogue [mailto:notifications@github.com] Sent: Monday, January 22, 2018 9:19 AM To: Azure/azure-webjobs-sdk-script azure-webjobs-sdk-script@noreply.github.com Cc: Michael Koltachev mkolt@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-webjobs-sdk-script] Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers (#2259)
@ealsurhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fealsur&data=02%7C01%7Cmkolt%40microsoft.com%7C7f6d40c821124ded281e08d561bc350a%7Cee3303d7fb734b0c8589bcd847f1c277%7C1%7C0%7C636522383337246705&sdata=9bfvy6FuN19CpKQlQ43mW8w7p6okX7dC4ZMe%2Fd1aGzA%3D&reserved=0 here is the exception I captured from the trace output of Change Feed Processor Library...
Lease Event Host Exception
The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.
BadRequestException Stacktrace
[2018/01/22 10:43:19.398] [Error] Partition 7 exception: Microsoft.Azure.Documents.BadRequestException: Message: {"Errors":["The page size is too big to preserve items that belong to same transaction. Reduce page size and try again."]}
ActivityId: a9099787-db3f-405b-8abd-757239864554, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats:
ResponseTime: 2018-01-22T16:43:19.3455416Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 306756, GlobalCommittedLsn: 306755, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1
at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
at Microsoft.Azure.Documents.ConsistencyReader.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility1.<>c__DisplayClass2_0
1.<
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility1.<ExecuteAsync>d__2
1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.StoreClient.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Client.DocumentClient.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery1.<ReadDocumentChangeFeedPrivateAsync>d__15
1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver
[2018/01/22 10:43:19.400] [Information] Releasing lease for partition 7 due to an error, reason: Unknown!
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' successfully removed PartitionId '7' with lease token '"b400eba2-0000-0000-0000-5a66149c0000"' from currently owned partitions.
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' closing event processor for PartitionId '7' and lease token '"b400eba2-0000-0000-0000-5a66149c0000"' with reason 'Unknown'
[2018/01/22 10:43:19.413] [Information] Host 'changedemo' releasing partition 7...
[2018/01/22 10:43:19.414] Observer closed, 0
[2018/01/22 10:43:19.414] Reason for shutdown, 0
[2018/01/22 10:43:19.415] [Information] Partition 7: worker finished!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-webjobs-sdk-script%2Fissues%2F2259%23issuecomment-359497935&data=02%7C01%7Cmkolt%40microsoft.com%7C7f6d40c821124ded281e08d561bc350a%7Cee3303d7fb734b0c8589bcd847f1c277%7C1%7C0%7C636522383337246705&sdata=b1yXiqHeEy4Z5M1qbg7%2FhkIZCmRBCMelAuU2xu3Lbhs%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQnQBNjptceiNUqMA1eLH-TWbzM6_-Tbks5tNML5gaJpZM4RPHkF&data=02%7C01%7Cmkolt%40microsoft.com%7C7f6d40c821124ded281e08d561bc350a%7Cee3303d7fb734b0c8589bcd847f1c277%7C1%7C0%7C636522383337246705&sdata=GPbE6rFZzLgONhNTYGMcJJbYtaMLmethSg6JhJKj4Ns%3D&reserved=0.
@mkolt - the sample app does not assign any value to ChangeFeedOptions.MaxItemCount
, so it's null in our demo app also. Does that mean unlimited number of items in the feed?
What is this value assigned to for azure change feed triggers? Not being able to assign ChangeFeedOptions
for cosmos function triggers is a issue I previously reported.
In any event - when this error occurs, the partition lease is frozen from processing new events until you reset the state in cosmos lease db. This seems to be the root cause of the reported issue.
@alohaninja - no, this means “use default” which is 100. Can you try to reduce and see which value? Change for CFP (github, functions integration would be some time later) to take care of this is coming soon.
@alohaninja Right now in Azure Functions the setting is not settable, but it is coming shortly.
@mkolt - what do you mean "Can you try to reduce and see which value?", you want me to figure out which value makes it run without crashing? Not clear on what your intent is.
It is consistently throwing the same error @ ChangeFeedOptions.MaxItemCount = 100
=> The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.
Seems like the size of the documents in the batch of 100 are overflowing a buffer? Am I just trying to identify how to reduce this threshold so that it doesn't crash? Not sure this will fix anything for the azure functions as we can't change it there yet. Will setting the site config to use32BitWorkerProcess=false
increase the page size so that it doesn't overflow?
Yes, that's why I meant. For standalone Change Feed Processor this is fixed (sync to current). For functions, it will take some time, hopefully not very long.
@ealsur / @mkolt - grabbed the latest CFP source and now here is the error output, it's still crashing out the lease partition state. You can see where the changes you made to ChangeFeedEventHost
were generating new output instead of getting the previous page size error.
[2018/01/24 02:10:24.475] [Warning] Reducing maxItemCount, new value: 50.
[2018/01/24 02:10:29.583] [Warning] Reducing maxItemCount, new value: 25.
[2018/01/24 02:10:34.683] [Warning] Reducing maxItemCount, new value: 12.
[2018/01/24 02:10:39.778] [Warning] Reducing maxItemCount, new value: 6.
[2018/01/24 02:10:44.870] [Warning] Reducing maxItemCount, new value: 3.
[2018/01/24 02:10:49.964] [Warning] Reducing maxItemCount, new value: 1.
[2018/01/24 02:10:55.451] [Error] Partition 24 exception: Microsoft.Azure.Documents.InternalServerErrorException: Message: {"Errors":["An unknown error occurred while processing this request."]}
ActivityId: 0374c881-c656-4e33-9200-00410870eb5a, Request URI: /apps/6628b461-75d4-4e4a-9897-ada4076dc30c/services/1a1bc4d9-3744-42ea-beec-00c84a423ce7/partitions/dc2dfff7-29ae-4e22-a0ed-185dffe099ea/replicas/131576284970258222s/, RequestStats:
ResponseTime: 2018-01-24T20:10:55.1040258Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdddc05-docdb-1.documents.azure.com:18884/apps/6628b461-75d4-4e4a-9897-ada4076dc30c/services/1a1bc4d9-3744-42ea-beec-00c84a423ce7/partitions/dc2dfff7-29ae-4e22-a0ed-185dffe099ea/replicas/131576284970258222s/, LSN: 841525, GlobalCommittedLsn: 841524, PartitionKeyRangeId: 24, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1
at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__2`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Client.DocumentClient.<ReadFeedAsync>d__254.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<GetFeedResponseAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<ReadDocumentChangeFeedPrivateAsync>d__15`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 327
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 393
[2018/01/24 02:10:55.468] [Information] Releasing lease for partition 24 due to an error, reason: Unknown!
[2018/01/24 02:10:55.489] [Information] Host 'changedemo' successfully removed PartitionId '24' with lease token '"11006070-0000-0000-0000-5a68e83f0000"' from currently owned partitions.
[2018/01/24 02:10:55.489] [Information] Host 'changedemo' closing event processor for PartitionId '24' and lease token '"11006070-0000-0000-0000-5a68e83f0000"' with reason 'Unknown'
[2018/01/24 02:10:55.515] [Information] Host 'changedemo' releasing partition 24...
[2018/01/24 02:10:55.517] Observer closed, 24
[2018/01/24 02:10:55.517] Reason for shutdown, Unknown
Please try again. This is slightly different issue: with maxItemCount=1 there is long transaction that doesn't fit into max response size, that must come from script modifying lots of big documents as part of one transaction, and currently continuation token cannot address a part of a transaction. Soon this is going to be supported.
Looks like it's fine now.
@mkolt looks like I'm still seeing this error An unknown error occurred while processing this request
. We do use stored procedures to bulk update documents - is this likely the cause? Not sure where else we would ever have a large transaction of documents being changed all at once.
@alohaninja looks like the partition that hit that again was able to move on. Yes, that's coming from the script. Let us know if you hit this again.
@mkolt - the partition only moves on because we manually update the state
and null the ContinuationToken
in the lease db collection. It will stay stuck at the ContinuationToken
because the CFP is unable to move forward.
@alohaninja Can you go back to problem state and try again?
@mkolt - I turned off our watchdog to auto-update the state
and it looks like the CFP now continues from the previous reset point, updating the state=1
(unhealthy) to state=2
(healthy). Seems like there is some progress now, previously no docs were flowing through the change feed once state=1
. The ContinuationToken
also updates as the change feed processes from the state transition.
{
"state": 1,
"PartitionId": "24",
"Owner": null,
"ContinuationToken": "\"842525\"",
"SequenceNumber": 1880
}
{
"state": 2,
"PartitionId": "24",
"Owner": "changedemo",
"ContinuationToken": "\"843154\"",
"SequenceNumber": 1889
}
I'll leave our watchdog ON, just reduce the frequency threshold to see if we have this fixed. Nice job!
@mkolt seeing this issue coming back today consistently with one of our monitored collections. Seems to be getting stuck as our watchdog is catching it and correcting it every 15 min instead of every 1 min. Is this related to a large transaction?
[2018/01/29 01:47:44.027] [Verbose] Worker task waiting for start signal: partition '7'
[2018/01/29 01:47:44.027] [Verbose] Worker task started: partition '7'
[2018/01/29 01:47:44.027] Observer opened for partition Key Range: 7
[2018/01/29 01:47:44.027] [Verbose] Worker start: partition '7', continuation '"309935"'
[2018/01/29 01:47:44.142] [Warning] Reducing maxItemCount, new value: 50.
[2018/01/29 01:47:49.257] [Warning] Reducing maxItemCount, new value: 25.
[2018/01/29 01:47:54.384] [Warning] Reducing maxItemCount, new value: 12.
[2018/01/29 01:47:59.497] [Warning] Reducing maxItemCount, new value: 6.
[2018/01/29 01:48:04.668] [Warning] Reducing maxItemCount, new value: 3.
[2018/01/29 01:48:09.782] [Warning] Reducing maxItemCount, new value: 1.
[2018/01/29 01:48:14.893] [Error] Partition 7 exception: Microsoft.Azure.Documents.InternalServerErrorException: Message: {"Errors":["An unknown error occurred while processing this request."]}
ActivityId: 9e5505d7-473f-4d85-8d4b-0aec95ebb01f, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats:
ResponseTime: 2018-01-29T19:48:14.8828306Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 310020, GlobalCommittedLsn: 310019, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.16299 documentdb-netcore-sdk/1.7.1
at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__2`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Client.DocumentClient.<ReadFeedAsync>d__254.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<GetFeedResponseAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<ReadDocumentChangeFeedPrivateAsync>d__15`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 327
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 327
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 393
[2018/01/29 01:48:14.893] [Information] Releasing lease for partition 7 due to an error, reason: Unknown!
[2018/01/29 01:48:14.893] [Information] Host 'changedemo' successfully removed PartitionId '7' with lease token '"3b002cdb-0000-0000-0000-5a6f7a700000"' from currently owned partitions.
[2018/01/29 01:48:14.893] [Information] Host 'changedemo' closing event processor for PartitionId '7' and lease token '"3b002cdb-0000-0000-0000-5a6f7a700000"' with reason 'Unknown'
[2018/01/29 01:48:14.893] [Information] Host 'changedemo' releasing partition 7...
[2018/01/29 01:48:14.893] Observer closed, 7
[2018/01/29 01:48:14.893] Reason for shutdown, Unknown
[2018/01/29 01:48:14.894] [Information] Partition 7: worker finished!
@alohaninja Yes, large transaction, different account (now test). Should be mitigated now.
Still seeing the issue daily where the lease needs reset, our watchdog script is still resetting invalid states - likely due to a large transaction update. I will be refactoring the large transaction to occur outside a stored procedure to see if this improves until we get large transaction support in CFP.
The stored procedure is exhausting server-bound resources as it occasionally receives "the script with id 'xxxxx is blocked for execution because it has violated its allowed resource limit several times." so it needs redesigned as a client-side script, rather than calling it repeatedly with continuation tokens as it is taking too long to execute with server-bound resources.
There don't seem to be any signals of stuck partitions during last 3 days. We can take a quick look at your sproc, if you want. Share via email.
@mkolt I just shared the sproc via email, hope you can identify any issues we have.
I tried refactoring the large transaction (server-side sproc) which bulk updates to client-side, but it takes 2-3x the time (updating 6k 100KB docs takes 5-6min server-side and 15-20min client-side). Unfortunately there is no way to do partial document updates yet so client-side is taking a penalty for retrieving and updating the entire document each time, adding a proportionate amount of latency with each client/server cycle.
I can consistently fail the CFP when sprocs exceed their bounded execution or process a large batch of bulk updates, presumably related to larger-sized docs. Until this can be remedied, we have to keep our lease watchdog actively fixing broken partition leases.
@ealsur Not sure if its related to this issue but in my case, change feed doesn't get triggered the first time, until I manually visit the Function App -> Change Feed Function -> function.json
-> Logs. This change feed is being observed on a partitioned collection with 25k RUs. Function is hosted on consumption plan with a single lease (leases collection: 400 RUs). Once it gets triggered it works fine afterwards but the first time I always have to navigate to the Logs as mentioned above.
The behavior is not tied to a specific Collection or Cosmos DB account but always reproducible by deploying this ARM template and running these load tests. The trigger is configured in code here.
@syedhassaanahmed please open a different issue, this is not related to SPROCs. The Consumption Plan is not creating instances for your Function App (that is why it's not triggering). When you open the Functions Portal, it forces the creation of one instance if there are none, and that is when it works.
Thanks @ealsur #2549
Any update? Also, is there any direction to someone starting new to not face such issues?
There is a PR that updates the CFP Library to a version that can handle batch size better for these scenarios. https://github.com/Azure/azure-webjobs-sdk-extensions/pull/410
Thanks. Any idea on next version of nuget to be released?
Since the fix is targeting V2, I don't know the release schedule. Maybe @paulbatum can shed more light on what could be an estimated time frame.
Since this PR is still pending review, I would guess the ETA for this to be merged and then deployed to prod to be around mid June.
@ealsur you say there is a fix targeting V2, what about customers running V1 Functions that are running in to this problem? what's the mitigation for them?
@ryancrawcour V1 dependencies cannot be upgrade futher due to a couple of hard lock dependencies on Functions V1 (Newtonsoft.Json & DocumentDB SDK).
What's the solution for customers running on V1?
Investigative information
Please provide the following:
Repro steps
Seems that after our change feeds have been running for awhile they intermittently just stop working. It seems it could be related to RU/s being exhausted, but there is no recovery process. There are 429s that were logged in the DB being monitored as it was under occasional load.
Expected behavior
Even when the DB being monitored is under load or receives 429s, the lease for the change feed should not ever get into a hung state. It should recover when the lease gets frozen. The input trigger should never stop firing.
Actual behavior
The input trigger stops firing and no new change feed docs are processed despite there still being traffic/activity in the lease db collection.
Known workarounds
The only known workaround I've been able to identity is to delete the lease collection and restart the function host. Restarting the function host recreates the lease collection (assuming
createLeaseCollectionIfNotExists=true
infunction.json
) and input triggers begin flowing in again. This is a very weak solution as we lose any change feed activity since the feed stopped responding.Related information
Language: Javascript AzureWebJobsHostLogs: stops writing trigger events when change feed is hung/expires, even though there are change feed events occurring and the lease collection has activity. The function host just stops checking for new events or is hung. Restarting the function app has no effect.
Lease Collection Partition State
Upon further inspection of the lease collection, all other partition documents contain
state=2
and contain a validOwner
guid. One partition does not contain an owner and hasstate=1
.Invalid Lease Collection Document
Other partitions are sending triggers to the function app, just not this partition stuck in
state=1
. Appears this issue is related to a partition state being stuck.