Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers

Azure / azure-functions-host

The host/runtime that powers Azure Functions

https://functions.azure.com

MIT License

1.94k stars 441 forks source link

Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers #2259

Closed alohaninja closed 6 years ago

alohaninja commented 6 years ago

Investigative information

Please provide the following:

Timestamp: 2017-12-28T15:42:22.000Z
Function App version (1.0 or 2.0-beta): 1.0
Invocation ID: 17bb63ca-ab9c-45bc-9b77-41c8e9a21dee
Region: East US 2

Repro steps

Seems that after our change feeds have been running for awhile they intermittently just stop working. It seems it could be related to RU/s being exhausted, but there is no recovery process. There are 429s that were logged in the DB being monitored as it was under occasional load.

Expected behavior

Even when the DB being monitored is under load or receives 429s, the lease for the change feed should not ever get into a hung state. It should recover when the lease gets frozen. The input trigger should never stop firing.

Actual behavior

The input trigger stops firing and no new change feed docs are processed despite there still being traffic/activity in the lease db collection.

Known workarounds

The only known workaround I've been able to identity is to delete the lease collection and restart the function host. Restarting the function host recreates the lease collection (assuming createLeaseCollectionIfNotExists=true in function.json) and input triggers begin flowing in again. This is a very weak solution as we lose any change feed activity since the feed stopped responding.

Related information

Language: Javascript AzureWebJobsHostLogs: stops writing trigger events when change feed is hung/expires, even though there are change feed events occurring and the lease collection has activity. The function host just stops checking for new events or is hung. Restarting the function app has no effect.

Lease Collection Partition State

Upon further inspection of the lease collection, all other partition documents contain state=2 and contain a valid Owner guid. One partition does not contain an owner and has state=1 .

Invalid Lease Collection Document

{
    "state": 1,
    "PartitionId": "7",
    "Owner": null,
    "ContinuationToken": "\"245895\"",
    "SequenceNumber": 4726
}

Other partitions are sending triggers to the function app, just not this partition stuck in state=1. Appears this issue is related to a partition state being stuck.

alohaninja commented 6 years ago

Known workarounds (Option 2)

So I found another workaround that doesn't require re-creating the entire lease collection. If you identify the documents in your lease collection that contain state=1 and Owner=null, you should set the ContinuationToken=null. Here is the before/after and the final state once the lease continued to get out of the errored state. Unsure why the ContinuationToken gets out of whack.

Lease Collection Document (Error State)

Partition change feed trigger stops sending input triggers to azure function. state=1 and Owner=null.

{
    "state": 1,
    "PartitionId": "7",
    "Owner": null,
    "ContinuationToken": "\"245895\"",
    "SequenceNumber": 4726
}

Lease Collection Document (Modified State)

Update the state=2 (healthy) and ContinuationToken=null.

{
    "state": 2,
    "PartitionId": "7",
    "Owner": null,
    "ContinuationToken": null,
    "SequenceNumber": 4726
}

Lease Collection Document (Healthy State)

ContinuationToken=<new_value>, Owner=<new_guid> and SequenceNumber increments.

{
    "state": 2,
    "PartitionId": "7",
        "Owner": "e40a9970-6f13-4fe3-9581-7e5306d982e5",
    "ContinuationToken": "\"246479\"",
    "SequenceNumber": 4727
}

After modifying the lease collection partition document, change feed events are no longer hung. Hoping someone can shed some light on why this happens. Seen this with numerous change feed functions we are using.

Downside is that any events that occurred when the hung state occurred are lost (since we reset the continuation token to get back to a healthy state). This is a major reliability hurdle for us (and the 2nd such critical issue we've identified with cosmos triggers)

We have numerous change feeds 10+ for various eventing and ALL of them failed at the same ContinuationToken for the same partition. We had to perform the above workaround 10x for each lease collection.

alohaninja commented 6 years ago

From a discovery perspective, we noticed that when you disable the Azure Function (set disabled=true in function.json) - the state in the lease collection changes all documents to state=1 and Owner=null (function host has stopped).

Wondering if this hang has something to do with the lease being paused and unable to resume after a function has been enabled/disabled. Odd thing is that some partitions have their state stuck (state=1) while others do not (state=2) in the same partition db lease collection even after the function has been re-enabled.

christopheranderson commented 6 years ago

@ealsur - this probably needs to be in the extensions repo that has cosmos db binding in it. Could you investigate?

ealsur commented 6 years ago

@alohaninja thanks for the detailed information. A couple of more questions:

Is this behavior over 1 collection?
Do you have more than 1 Function monitoring the same collection? if so, does each Function have its own lease collection or are they sharing the lease collection also?
Does the collection that appears to hang always on the same PartitionId (attribute in the lease)?
Is it running in App Service Mode or Consumption Plan?

The Trigger implements the Change Feed Processor Library, according to its source code, the state 1 means Available to be picked up. The Library will attempt to pickup available leases automatically.

The Library uses the Change Feed which consumes RUs to be read, if the collection is being rate limited, then the reads of the Feed might be receiving a 429 for a particular partition range (in case of Hot Partitions).

alohaninja commented 6 years ago

@ealsur here are the answers to your inquiry. We are continuing to see business interruptions with this design - assuming that every time we exhaust RU/s we seem to get 1-2 partitions that get stuck not processing change feed responses. At this point, I'm not 100% certain it's related to RU/s but I can't seem to find anything else that would be causing invalid leases. Would be nice to review the logs of the Change Feed Processor Library we use.

Change Feed Feedback

Is this behavior over 1 collection? Yes - we have multiple change feeds monitoring the same collection. Each feed/function seems to fail on the same partition when it gets stuck @ state=1.
Do you have more than 1 Function monitoring the same collection? if so, does each Function have its own lease collection or are they sharing the lease collection also? We currently have (8) functions monitoring the same cosmos collection, each function must have it's own lease collection. This is happening across multiple slots also which we use for testing in a separate function app and cosmos instance.
Does the collection that appears to hang always on the same PartitionId (attribute in the lease)? It does appear that the collection is hanging on the same PartitionId across all change feeds when this happens. However, the ContinuationToken is not always the same, it's slightly different. Subsequent outages will not be the same PartitionId - it fluctuates between outage instances.
Is it running in App Service Mode or Consumption Plan? We are using App Service Mode S3. We do see this issue more in our DEV slots which have much lower RU/s threshold and do tend to go over capacity more frequently

Workaround Status

At this point because this is happening almost daily (it happened once already this morning) we've written a script to search all (8) lease collections (3x cosmos instances for our test environments) to update the invalid lease state so that messages will be processed. I haven't automated it for running as I'm trying to correlate it and track the outages, but I may have to if we can't get this fixed soon. Spending alot of time watching the change feed leases to make sure they are properly processing messages.

Perhaps this issue is related to an existing race condition with cosmos change feed library and isn't related to RU/s at all?

ealsur commented 6 years ago

@alohaninja Could you provide with some account information like Subscription ID, account name and collection where you are experiencing this? Lease collection name if possible too. I want to see if I can access any logs or metrics on our side.

Do you see in the Portal metrics under the Throughput tab, 429s in the collection being monitored?

It is weird that all your Function instances are stopping at the same time. If it was the case of the linked issue, changing the lease state would not work, since the Observer is stopped.

@mkolt Could the Processor be dropping the lease (the one for the partition that seems to be high on RUs), changing it's state to 1 (Available), removing the Owner and never picking it back up?

alohaninja commented 6 years ago

@ealsur There are usually 429s in the collection but I'm not convinced that's it - I suspect it could be related to load, but I don't have an alert when it happens yet to correlate the timing. The timestamp for the document in the lease database continues to be updated even when it is in the hung state (state=1) so I don't know when exactly it goes offline.

Do you have an email I can send you the subId/account/collection details? Prefer not to make it public.

ealsur commented 6 years ago

@alohaninja Sure thing: maquaran{at}microsoft.com

mkolt commented 6 years ago

The lease could be dropped (owner set to empty, state = available) due to multiple reasons, the lease can also expire if one of processors dies, etc. All this is totally fine. Any such lease lease will be picked up next time any processor does lease acquire, this happens periodically with ChangeFeedOptions.LeaseAcquireInterval.

From: Matias Quaranta [mailto:notifications@github.com] Sent: Thursday, January 11, 2018 10:11 AM To: Azure/azure-webjobs-sdk-script azure-webjobs-sdk-script@noreply.github.com Cc: Michael Koltachev mkolt@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-webjobs-sdk-script] Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers (#2259)

@alohaninjahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falohaninja&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396420472&sdata=C16vq%2Fqgih7scEc4fcSnoeY9SRLQv%2BWaiwq4YKBebGA%3D&reserved=0 Could you provide with some account information like Subscription ID, account name and collection where you are experiencing this? Lease collection name if possible too. I want to see if I can access any logs or metrics on our side.

Do you see in the Portal metrics under the Throughput tab, 429s in the collection being monitored?

It is weird that all your Function instances are stopping at the same time. If it was the case of the linked issue, changing the lease state would not work, since the Observer is stopped.

@mkolthttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmkolt&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396420472&sdata=UzxIYYKtbFOzY9sQ1TCucz79TBmjVbGgE2yzXzbyOFI%3D&reserved=0 Could the Processor be dropping the lease (the one for the partition that seems to be high on RUs), changing it's state to 1 (Available), removing the Owner and never picking it back up?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-webjobs-sdk-script%2Fissues%2F2259%23issuecomment-357013027&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396430480&sdata=mKvHbQLzPU0hna8Or%2BZWmsUcAqZwa2EXp%2FlzpHYcYwM%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQnQBBM4p6Z1O6GvoBxqLtvWu0EbjBK_ks5tJk6cgaJpZM4RPHkF&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396430480&sdata=EcZL34%2F6MV5pUrVwOSzVcUs7gmWaO3et4kA3LVpi1mM%3D&reserved=0.

ealsur commented 6 years ago

Thanks @mkolt ! @alohaninja Is your App Service using multiple instances? Is the amount of instances changing over time (auto-scale or manual-scale)?

alohaninja commented 6 years ago

@ealsur emailed you the subscription details, please share with @mkolt .

Our App Service has (2) fixed S3 instances, we are NOT using auto-scale.

alohaninja commented 6 years ago

@mkolt - the lease partition document never resumes the lease, it remains stuck indefinitely.

The statement Any such lease will be picked up next time any processor does lease acquire, this happens periodically with ChangeFeedOptions.LeaseAcquireInterval. doesn't seem to match what we see. The lease is never renewed/re-aquired.

mkolt commented 6 years ago

@alohaninja can you try using Change Feed processor directly (with different lease collection)? The idea is to see whether standalone Change Feed processor would hit same issue.

ealsur commented 6 years ago

@alohaninja When the lease gets lost:

Do the other leases have Owner and are the Owner values are divided into 2 groups? The idea is that if you have an App Service with 2 instances, the leases should be divided in 2 owners. I want to see if when the issue is hit, each owner has the same # of leases or not.
Do the rest of the Partitions keep triggering or does the Function stop completely?

alohaninja commented 6 years ago

@mkolt I'll try to setup the change feed processor directly using a different lease collection, have not tried that yet.

@ealsur reponse:

Do the other leases have Owner and are the Owner values are divided into 2 groups? The idea is that if you have an App Service with 2 instances, the leases should be divided in 2 owners. I want to see if when the issue is hit, each owner has the same # of leases or not. _The other leases have Owner values and are typically split between the 25 partitions. Just now it happened and 11 are assigned to one owner (1,11,13,14,15,17,18,20,21,23), and 13 to another owner (0,2,3,4,5,6,7,8,9,10,12,16,22). The PartitionId that fails is not always the same, this time it was 24, yesterday it was 2. After resetting the state & ContinuationToken manually, the Owner with 11 picked up the new lease just fine._
Do the rest of the Partitions keep triggering or does the Function stop completely? The other partitions continue triggering events just fine, so it's just a partial failure of the lease for a specific partition(s). This makes it especially hard to notice because the function still processes events for other partitions.

Note: When it failed this morning, no RU/s were exhausted, so it doesn't seem related to exceeding capacity, perhaps it's load or a race condition?

ealsur commented 6 years ago

If the other partitions keep working, then it's not the case of the linked issue, that issue halted the entire Observer (no more function executions).

It's a very weird case, since the Observer keeps processing the other PK Ranges.

alohaninja commented 6 years ago

@ealsur I finally got around to setting up a demo app with partition observers using the Change Feed Processor Library sample code, just waiting to see if they handle the lease interruptions the same way. Have them running now and hope to report back soon - running just a single change feed event host for each monitored collection.

For now, I just created another watchdog function app the auto-fixes the broken leases collection partition states that polls every minute using the script previously created. The issue is still happening regularly across multiple collections/function apps. This also logs to application insights so we can better track when the outages occur.

alohaninja commented 6 years ago

@ealsur The demo app is also seeing the lease partition failures, I was able to reproduce this with the Change Feed Processor Library The error reason ChangeFeedObserverCloseReason.Unknown it reports is not useful though, so I am now monitoring the collections using trace output from the Change Feed Processor Library for more details.

Here is the log output the DocumentFeedObserver reported...

Change feed: PartitionId 22 total 699 doc(s)
72ac4b93bcb24e299262166a9be8c864
Observer closed, 22
Reason for shutdown, Unknown
Observer opened for partition Key Range: 22
Observer closed, 22
Reason for shutdown, Unknown

The partition observer closed mid-stream while processing (22) docs. It seems that it opened and closed the partition like this 8 times until the state was reset. The only reason it resumed is because the watchdog we have that runs every minute reset the invalid state document in the cosmos lease db.

ealsur commented 6 years ago

Very interesting @alohaninja , so the issue is not directly related to the Azure Function, but with the underlying library.

@mkolt Is there any extra logging settings that can be added on his sample app that could unearth the real reason behind the Unknown? Could it be a Max Batch Size issue?

alohaninja commented 6 years ago

@ealsur here is the exception I captured from the trace output of Change Feed Processor Library. This error occurs repeatedly in the change feed library until we reset the state of the partitioned lease in cosmos db.

Lease Event Host Exception

The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.

BadRequestException Stacktrace

[2018/01/22 10:43:19.398] [Error] Partition 7 exception: Microsoft.Azure.Documents.BadRequestException: Message: {"Errors":["The page size is too big to preserve items that belong to same transaction. Reduce page size and try again."]}
ActivityId: a9099787-db3f-405b-8abd-757239864554, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats: 
ResponseTime: 2018-01-22T16:43:19.3455416Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 306756, GlobalCommittedLsn: 306755, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1
   at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
   at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__2`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Client.DocumentClient.<ReadFeedAsync>d__254.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<GetFeedResponseAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<ReadDocumentChangeFeedPrivateAsync>d__15`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 326
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 379
[2018/01/22 10:43:19.400] [Information] Releasing lease for partition 7 due to an error, reason: Unknown!
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' successfully removed PartitionId '7' with lease token '"b400eba2-0000-0000-0000-5a66149c0000"' from currently owned partitions.
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' closing event processor for PartitionId '7' and lease token '"b400eba2-0000-0000-0000-5a66149c0000"' with reason 'Unknown'
[2018/01/22 10:43:19.413] [Information] Host 'changedemo' releasing partition 7...
[2018/01/22 10:43:19.414] Observer closed, Unknown
[2018/01/22 10:43:19.414] Reason for shutdown, Unknown
[2018/01/22 10:43:19.415] [Information] Partition 7: worker finished!

mkolt commented 6 years ago

Which ChangeFeedOptions.MaxItemCount do you use?

From: Aaron Hogue [mailto:notifications@github.com] Sent: Monday, January 22, 2018 9:19 AM To: Azure/azure-webjobs-sdk-script azure-webjobs-sdk-script@noreply.github.com Cc: Michael Koltachev mkolt@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-webjobs-sdk-script] Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers (#2259)

@ealsurhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fealsur&data=02%7C01%7Cmkolt%40microsoft.com%7C7f6d40c821124ded281e08d561bc350a%7Cee3303d7fb734b0c8589bcd847f1c277%7C1%7C0%7C636522383337246705&sdata=9bfvy6FuN19CpKQlQ43mW8w7p6okX7dC4ZMe%2Fd1aGzA%3D&reserved=0 here is the exception I captured from the trace output of Change Feed Processor Library...

Lease Event Host Exception

The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.

BadRequestException Stacktrace

[2018/01/22 10:43:19.398] [Error] Partition 7 exception: Microsoft.Azure.Documents.BadRequestException: Message: {"Errors":["The page size is too big to preserve items that belong to same transaction. Reduce page size and try again."]}

ActivityId: a9099787-db3f-405b-8abd-757239864554, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats:

ResponseTime: 2018-01-22T16:43:19.3455416Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 306756, GlobalCommittedLsn: 306755, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed

, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1

at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)

at Microsoft.Azure.Documents.ConsistencyReader.d__14.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.ReplicatedResourceClient.d__19.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<b__0>d.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility1.<>c__DisplayClass2_01.<b__0>d.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.d__3.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.d__3.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility1.<ExecuteAsync>d__21.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.ReplicatedResourceClient.d__18.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.StoreClient.d__16.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.Client.DocumentClient.d__254.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.d__16.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.Linq.ChangeFeedQuery1.<ReadDocumentChangeFeedPrivateAsync>d__151.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<b__0>d.MoveNext()