Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.94k stars 441 forks source link

Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers #2259

Closed alohaninja closed 6 years ago

alohaninja commented 6 years ago

Investigative information

Please provide the following:

Repro steps

Seems that after our change feeds have been running for awhile they intermittently just stop working. It seems it could be related to RU/s being exhausted, but there is no recovery process. There are 429s that were logged in the DB being monitored as it was under occasional load.

Expected behavior

Even when the DB being monitored is under load or receives 429s, the lease for the change feed should not ever get into a hung state. It should recover when the lease gets frozen. The input trigger should never stop firing.

Actual behavior

The input trigger stops firing and no new change feed docs are processed despite there still being traffic/activity in the lease db collection.

Known workarounds

The only known workaround I've been able to identity is to delete the lease collection and restart the function host. Restarting the function host recreates the lease collection (assuming createLeaseCollectionIfNotExists=true in function.json) and input triggers begin flowing in again. This is a very weak solution as we lose any change feed activity since the feed stopped responding.

Related information

Language: Javascript AzureWebJobsHostLogs: stops writing trigger events when change feed is hung/expires, even though there are change feed events occurring and the lease collection has activity. The function host just stops checking for new events or is hung. Restarting the function app has no effect.

Lease Collection Partition State

Upon further inspection of the lease collection, all other partition documents contain state=2 and contain a valid Owner guid. One partition does not contain an owner and has state=1 .

Invalid Lease Collection Document
{
    "state": 1,
    "PartitionId": "7",
    "Owner": null,
    "ContinuationToken": "\"245895\"",
    "SequenceNumber": 4726
}

Other partitions are sending triggers to the function app, just not this partition stuck in state=1. Appears this issue is related to a partition state being stuck.

alohaninja commented 6 years ago

Known workarounds (Option 2)

So I found another workaround that doesn't require re-creating the entire lease collection. If you identify the documents in your lease collection that contain state=1 and Owner=null, you should set the ContinuationToken=null. Here is the before/after and the final state once the lease continued to get out of the errored state. Unsure why the ContinuationToken gets out of whack.

Lease Collection Document (Error State)

Partition change feed trigger stops sending input triggers to azure function. state=1 and Owner=null.

{
    "state": 1,
    "PartitionId": "7",
    "Owner": null,
    "ContinuationToken": "\"245895\"",
    "SequenceNumber": 4726
}

Lease Collection Document (Modified State)

Update the state=2 (healthy) and ContinuationToken=null.

{
    "state": 2,
    "PartitionId": "7",
    "Owner": null,
    "ContinuationToken": null,
    "SequenceNumber": 4726
}

Lease Collection Document (Healthy State)

ContinuationToken=<new_value>, Owner=<new_guid> and SequenceNumber increments.

{
    "state": 2,
    "PartitionId": "7",
        "Owner": "e40a9970-6f13-4fe3-9581-7e5306d982e5",
    "ContinuationToken": "\"246479\"",
    "SequenceNumber": 4727
}

After modifying the lease collection partition document, change feed events are no longer hung. Hoping someone can shed some light on why this happens. Seen this with numerous change feed functions we are using.

Downside is that any events that occurred when the hung state occurred are lost (since we reset the continuation token to get back to a healthy state). This is a major reliability hurdle for us (and the 2nd such critical issue we've identified with cosmos triggers)

We have numerous change feeds 10+ for various eventing and ALL of them failed at the same ContinuationToken for the same partition. We had to perform the above workaround 10x for each lease collection.

alohaninja commented 6 years ago

From a discovery perspective, we noticed that when you disable the Azure Function (set disabled=true in function.json) - the state in the lease collection changes all documents to state=1 and Owner=null (function host has stopped).

Wondering if this hang has something to do with the lease being paused and unable to resume after a function has been enabled/disabled. Odd thing is that some partitions have their state stuck (state=1) while others do not (state=2) in the same partition db lease collection even after the function has been re-enabled.

christopheranderson commented 6 years ago

@ealsur - this probably needs to be in the extensions repo that has cosmos db binding in it. Could you investigate?

ealsur commented 6 years ago

@alohaninja thanks for the detailed information. A couple of more questions:

  1. Is this behavior over 1 collection?
  2. Do you have more than 1 Function monitoring the same collection? if so, does each Function have its own lease collection or are they sharing the lease collection also?
  3. Does the collection that appears to hang always on the same PartitionId (attribute in the lease)?
  4. Is it running in App Service Mode or Consumption Plan?

The Trigger implements the Change Feed Processor Library, according to its source code, the state 1 means Available to be picked up. The Library will attempt to pickup available leases automatically.

The Library uses the Change Feed which consumes RUs to be read, if the collection is being rate limited, then the reads of the Feed might be receiving a 429 for a particular partition range (in case of Hot Partitions).

alohaninja commented 6 years ago

@ealsur here are the answers to your inquiry. We are continuing to see business interruptions with this design - assuming that every time we exhaust RU/s we seem to get 1-2 partitions that get stuck not processing change feed responses. At this point, I'm not 100% certain it's related to RU/s but I can't seem to find anything else that would be causing invalid leases. Would be nice to review the logs of the Change Feed Processor Library we use.

Change Feed Feedback

  1. Is this behavior over 1 collection? Yes - we have multiple change feeds monitoring the same collection. Each feed/function seems to fail on the same partition when it gets stuck @ state=1.
  2. Do you have more than 1 Function monitoring the same collection? if so, does each Function have its own lease collection or are they sharing the lease collection also? We currently have (8) functions monitoring the same cosmos collection, each function must have it's own lease collection. This is happening across multiple slots also which we use for testing in a separate function app and cosmos instance.
  3. Does the collection that appears to hang always on the same PartitionId (attribute in the lease)? It does appear that the collection is hanging on the same PartitionId across all change feeds when this happens. However, the ContinuationToken is not always the same, it's slightly different. Subsequent outages will not be the same PartitionId - it fluctuates between outage instances.
  4. Is it running in App Service Mode or Consumption Plan? We are using App Service Mode S3. We do see this issue more in our DEV slots which have much lower RU/s threshold and do tend to go over capacity more frequently

Workaround Status

At this point because this is happening almost daily (it happened once already this morning) we've written a script to search all (8) lease collections (3x cosmos instances for our test environments) to update the invalid lease state so that messages will be processed. I haven't automated it for running as I'm trying to correlate it and track the outages, but I may have to if we can't get this fixed soon. Spending alot of time watching the change feed leases to make sure they are properly processing messages.

Perhaps this issue is related to an existing race condition with cosmos change feed library and isn't related to RU/s at all?

ealsur commented 6 years ago

@alohaninja Could you provide with some account information like Subscription ID, account name and collection where you are experiencing this? Lease collection name if possible too. I want to see if I can access any logs or metrics on our side.

Do you see in the Portal metrics under the Throughput tab, 429s in the collection being monitored?

It is weird that all your Function instances are stopping at the same time. If it was the case of the linked issue, changing the lease state would not work, since the Observer is stopped.

@mkolt Could the Processor be dropping the lease (the one for the partition that seems to be high on RUs), changing it's state to 1 (Available), removing the Owner and never picking it back up?

alohaninja commented 6 years ago

@ealsur There are usually 429s in the collection but I'm not convinced that's it - I suspect it could be related to load, but I don't have an alert when it happens yet to correlate the timing. The timestamp for the document in the lease database continues to be updated even when it is in the hung state (state=1) so I don't know when exactly it goes offline.

Do you have an email I can send you the subId/account/collection details? Prefer not to make it public.

ealsur commented 6 years ago

@alohaninja Sure thing: maquaran{at}microsoft.com

mkolt commented 6 years ago

The lease could be dropped (owner set to empty, state = available) due to multiple reasons, the lease can also expire if one of processors dies, etc. All this is totally fine. Any such lease lease will be picked up next time any processor does lease acquire, this happens periodically with ChangeFeedOptions.LeaseAcquireInterval.

From: Matias Quaranta [mailto:notifications@github.com] Sent: Thursday, January 11, 2018 10:11 AM To: Azure/azure-webjobs-sdk-script azure-webjobs-sdk-script@noreply.github.com Cc: Michael Koltachev mkolt@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-webjobs-sdk-script] Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers (#2259)

@alohaninjahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falohaninja&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396420472&sdata=C16vq%2Fqgih7scEc4fcSnoeY9SRLQv%2BWaiwq4YKBebGA%3D&reserved=0 Could you provide with some account information like Subscription ID, account name and collection where you are experiencing this? Lease collection name if possible too. I want to see if I can access any logs or metrics on our side.

Do you see in the Portal metrics under the Throughput tab, 429s in the collection being monitored?

It is weird that all your Function instances are stopping at the same time. If it was the case of the linked issue, changing the lease state would not work, since the Observer is stopped.

@mkolthttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmkolt&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396420472&sdata=UzxIYYKtbFOzY9sQ1TCucz79TBmjVbGgE2yzXzbyOFI%3D&reserved=0 Could the Processor be dropping the lease (the one for the partition that seems to be high on RUs), changing it's state to 1 (Available), removing the Owner and never picking it back up?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-webjobs-sdk-script%2Fissues%2F2259%23issuecomment-357013027&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396430480&sdata=mKvHbQLzPU0hna8Or%2BZWmsUcAqZwa2EXp%2FlzpHYcYwM%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQnQBBM4p6Z1O6GvoBxqLtvWu0EbjBK_ks5tJk6cgaJpZM4RPHkF&data=02%7C01%7Cmkolt%40microsoft.com%7Ca9815471b03248daec0f08d5591e9e74%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636512910396430480&sdata=EcZL34%2F6MV5pUrVwOSzVcUs7gmWaO3et4kA3LVpi1mM%3D&reserved=0.

ealsur commented 6 years ago

Thanks @mkolt ! @alohaninja Is your App Service using multiple instances? Is the amount of instances changing over time (auto-scale or manual-scale)?

alohaninja commented 6 years ago

@ealsur emailed you the subscription details, please share with @mkolt .

Our App Service has (2) fixed S3 instances, we are NOT using auto-scale.

alohaninja commented 6 years ago

@mkolt - the lease partition document never resumes the lease, it remains stuck indefinitely.

The statement Any such lease will be picked up next time any processor does lease acquire, this happens periodically with ChangeFeedOptions.LeaseAcquireInterval. doesn't seem to match what we see. The lease is never renewed/re-aquired.

mkolt commented 6 years ago

@alohaninja can you try using Change Feed processor directly (with different lease collection)? The idea is to see whether standalone Change Feed processor would hit same issue.

ealsur commented 6 years ago

@alohaninja When the lease gets lost:

  1. Do the other leases have Owner and are the Owner values are divided into 2 groups? The idea is that if you have an App Service with 2 instances, the leases should be divided in 2 owners. I want to see if when the issue is hit, each owner has the same # of leases or not.
  2. Do the rest of the Partitions keep triggering or does the Function stop completely?
alohaninja commented 6 years ago

@mkolt I'll try to setup the change feed processor directly using a different lease collection, have not tried that yet.

@ealsur reponse:

  1. Do the other leases have Owner and are the Owner values are divided into 2 groups? The idea is that if you have an App Service with 2 instances, the leases should be divided in 2 owners. I want to see if when the issue is hit, each owner has the same # of leases or not. _The other leases have Owner values and are typically split between the 25 partitions. Just now it happened and 11 are assigned to one owner (1,11,13,14,15,17,18,20,21,23), and 13 to another owner (0,2,3,4,5,6,7,8,9,10,12,16,22). The PartitionId that fails is not always the same, this time it was 24, yesterday it was 2. After resetting the state & ContinuationToken manually, the Owner with 11 picked up the new lease just fine._

  2. Do the rest of the Partitions keep triggering or does the Function stop completely? The other partitions continue triggering events just fine, so it's just a partial failure of the lease for a specific partition(s). This makes it especially hard to notice because the function still processes events for other partitions.

Note: When it failed this morning, no RU/s were exhausted, so it doesn't seem related to exceeding capacity, perhaps it's load or a race condition?

ealsur commented 6 years ago

If the other partitions keep working, then it's not the case of the linked issue, that issue halted the entire Observer (no more function executions).

It's a very weird case, since the Observer keeps processing the other PK Ranges.

alohaninja commented 6 years ago

@ealsur I finally got around to setting up a demo app with partition observers using the Change Feed Processor Library sample code, just waiting to see if they handle the lease interruptions the same way. Have them running now and hope to report back soon - running just a single change feed event host for each monitored collection.

For now, I just created another watchdog function app the auto-fixes the broken leases collection partition states that polls every minute using the script previously created. The issue is still happening regularly across multiple collections/function apps. This also logs to application insights so we can better track when the outages occur.

alohaninja commented 6 years ago

@ealsur The demo app is also seeing the lease partition failures, I was able to reproduce this with the Change Feed Processor Library The error reason ChangeFeedObserverCloseReason.Unknown it reports is not useful though, so I am now monitoring the collections using trace output from the Change Feed Processor Library for more details.

Here is the log output the DocumentFeedObserver reported...

Change feed: PartitionId 22 total 699 doc(s)
72ac4b93bcb24e299262166a9be8c864
Observer closed, 22
Reason for shutdown, Unknown
Observer opened for partition Key Range: 22
Observer closed, 22
Reason for shutdown, Unknown

The partition observer closed mid-stream while processing (22) docs. It seems that it opened and closed the partition like this 8 times until the state was reset. The only reason it resumed is because the watchdog we have that runs every minute reset the invalid state document in the cosmos lease db.

ealsur commented 6 years ago

Very interesting @alohaninja , so the issue is not directly related to the Azure Function, but with the underlying library.

@mkolt Is there any extra logging settings that can be added on his sample app that could unearth the real reason behind the Unknown? Could it be a Max Batch Size issue?

alohaninja commented 6 years ago

@ealsur here is the exception I captured from the trace output of Change Feed Processor Library. This error occurs repeatedly in the change feed library until we reset the state of the partitioned lease in cosmos db.

Lease Event Host Exception

The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.

BadRequestException Stacktrace

[2018/01/22 10:43:19.398] [Error] Partition 7 exception: Microsoft.Azure.Documents.BadRequestException: Message: {"Errors":["The page size is too big to preserve items that belong to same transaction. Reduce page size and try again."]}
ActivityId: a9099787-db3f-405b-8abd-757239864554, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats: 
ResponseTime: 2018-01-22T16:43:19.3455416Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 306756, GlobalCommittedLsn: 306755, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1
   at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
   at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__2`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Client.DocumentClient.<ReadFeedAsync>d__254.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<GetFeedResponseAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<ReadDocumentChangeFeedPrivateAsync>d__15`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 326
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 379
[2018/01/22 10:43:19.400] [Information] Releasing lease for partition 7 due to an error, reason: Unknown!
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' successfully removed PartitionId '7' with lease token '"b400eba2-0000-0000-0000-5a66149c0000"' from currently owned partitions.
[2018/01/22 10:43:19.408] [Information] Host 'changedemo' closing event processor for PartitionId '7' and lease token '"b400eba2-0000-0000-0000-5a66149c0000"' with reason 'Unknown'
[2018/01/22 10:43:19.413] [Information] Host 'changedemo' releasing partition 7...
[2018/01/22 10:43:19.414] Observer closed, Unknown
[2018/01/22 10:43:19.414] Reason for shutdown, Unknown
[2018/01/22 10:43:19.415] [Information] Partition 7: worker finished!
mkolt commented 6 years ago

Which ChangeFeedOptions.MaxItemCount do you use?

From: Aaron Hogue [mailto:notifications@github.com] Sent: Monday, January 22, 2018 9:19 AM To: Azure/azure-webjobs-sdk-script azure-webjobs-sdk-script@noreply.github.com Cc: Michael Koltachev mkolt@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-webjobs-sdk-script] Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers (#2259)

@ealsurhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fealsur&data=02%7C01%7Cmkolt%40microsoft.com%7C7f6d40c821124ded281e08d561bc350a%7Cee3303d7fb734b0c8589bcd847f1c277%7C1%7C0%7C636522383337246705&sdata=9bfvy6FuN19CpKQlQ43mW8w7p6okX7dC4ZMe%2Fd1aGzA%3D&reserved=0 here is the exception I captured from the trace output of Change Feed Processor Library...

Lease Event Host Exception

The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.

BadRequestException Stacktrace

[2018/01/22 10:43:19.398] [Error] Partition 7 exception: Microsoft.Azure.Documents.BadRequestException: Message: {"Errors":["The page size is too big to preserve items that belong to same transaction. Reduce page size and try again."]}

ActivityId: a9099787-db3f-405b-8abd-757239864554, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats:

ResponseTime: 2018-01-22T16:43:19.3455416Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 306756, GlobalCommittedLsn: 306755, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed

, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1

at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)

at Microsoft.Azure.Documents.ConsistencyReader.d__14.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.ReplicatedResourceClient.d__19.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<b__0>d.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility1.<>c__DisplayClass2_01.<b__0>d.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.d__3.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.d__3.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility1.<ExecuteAsync>d__21.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.ReplicatedResourceClient.d__18.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.StoreClient.d__16.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.Client.DocumentClient.d__254.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.d__16.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.Linq.ChangeFeedQuery1.<ReadDocumentChangeFeedPrivateAsync>d__151.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<b__0>d.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.d__3.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.d__3.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at Microsoft.Azure.Documents.BackoffRetryUtility`1.d__1.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()

at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 326

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass28_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 379

[2018/01/22 10:43:19.400] [Information] Releasing lease for partition 7 due to an error, reason: Unknown!

[2018/01/22 10:43:19.408] [Information] Host 'changedemo' successfully removed PartitionId '7' with lease token '"b400eba2-0000-0000-0000-5a66149c0000"' from currently owned partitions.

[2018/01/22 10:43:19.408] [Information] Host 'changedemo' closing event processor for PartitionId '7' and lease token '"b400eba2-0000-0000-0000-5a66149c0000"' with reason 'Unknown'

[2018/01/22 10:43:19.413] [Information] Host 'changedemo' releasing partition 7...

[2018/01/22 10:43:19.414] Observer closed, 0

[2018/01/22 10:43:19.414] Reason for shutdown, 0

[2018/01/22 10:43:19.415] [Information] Partition 7: worker finished!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-webjobs-sdk-script%2Fissues%2F2259%23issuecomment-359497935&data=02%7C01%7Cmkolt%40microsoft.com%7C7f6d40c821124ded281e08d561bc350a%7Cee3303d7fb734b0c8589bcd847f1c277%7C1%7C0%7C636522383337246705&sdata=b1yXiqHeEy4Z5M1qbg7%2FhkIZCmRBCMelAuU2xu3Lbhs%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQnQBNjptceiNUqMA1eLH-TWbzM6_-Tbks5tNML5gaJpZM4RPHkF&data=02%7C01%7Cmkolt%40microsoft.com%7C7f6d40c821124ded281e08d561bc350a%7Cee3303d7fb734b0c8589bcd847f1c277%7C1%7C0%7C636522383337246705&sdata=GPbE6rFZzLgONhNTYGMcJJbYtaMLmethSg6JhJKj4Ns%3D&reserved=0.

alohaninja commented 6 years ago

@mkolt - the sample app does not assign any value to ChangeFeedOptions.MaxItemCount, so it's null in our demo app also. Does that mean unlimited number of items in the feed?

What is this value assigned to for azure change feed triggers? Not being able to assign ChangeFeedOptions for cosmos function triggers is a issue I previously reported.

In any event - when this error occurs, the partition lease is frozen from processing new events until you reset the state in cosmos lease db. This seems to be the root cause of the reported issue.

mkolt commented 6 years ago

@alohaninja - no, this means “use default” which is 100. Can you try to reduce and see which value? Change for CFP (github, functions integration would be some time later) to take care of this is coming soon.

ealsur commented 6 years ago

@alohaninja Right now in Azure Functions the setting is not settable, but it is coming shortly.

alohaninja commented 6 years ago

@mkolt - what do you mean "Can you try to reduce and see which value?", you want me to figure out which value makes it run without crashing? Not clear on what your intent is.

It is consistently throwing the same error @ ChangeFeedOptions.MaxItemCount = 100 => The page size is too big to preserve items that belong to same transaction. Reduce page size and try again.

Seems like the size of the documents in the batch of 100 are overflowing a buffer? Am I just trying to identify how to reduce this threshold so that it doesn't crash? Not sure this will fix anything for the azure functions as we can't change it there yet. Will setting the site config to use32BitWorkerProcess=false increase the page size so that it doesn't overflow?

mkolt commented 6 years ago

Yes, that's why I meant. For standalone Change Feed Processor this is fixed (sync to current). For functions, it will take some time, hopefully not very long.

alohaninja commented 6 years ago

@ealsur / @mkolt - grabbed the latest CFP source and now here is the error output, it's still crashing out the lease partition state. You can see where the changes you made to ChangeFeedEventHost were generating new output instead of getting the previous page size error.

InternalServerErrorException - An unknown error occurred while processing this request.

[2018/01/24 02:10:24.475] [Warning] Reducing maxItemCount, new value: 50.
[2018/01/24 02:10:29.583] [Warning] Reducing maxItemCount, new value: 25.
[2018/01/24 02:10:34.683] [Warning] Reducing maxItemCount, new value: 12.
[2018/01/24 02:10:39.778] [Warning] Reducing maxItemCount, new value: 6.
[2018/01/24 02:10:44.870] [Warning] Reducing maxItemCount, new value: 3.
[2018/01/24 02:10:49.964] [Warning] Reducing maxItemCount, new value: 1.
[2018/01/24 02:10:55.451] [Error] Partition 24 exception: Microsoft.Azure.Documents.InternalServerErrorException: Message: {"Errors":["An unknown error occurred while processing this request."]}
ActivityId: 0374c881-c656-4e33-9200-00410870eb5a, Request URI: /apps/6628b461-75d4-4e4a-9897-ada4076dc30c/services/1a1bc4d9-3744-42ea-beec-00c84a423ce7/partitions/dc2dfff7-29ae-4e22-a0ed-185dffe099ea/replicas/131576284970258222s/, RequestStats: 
ResponseTime: 2018-01-24T20:10:55.1040258Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdddc05-docdb-1.documents.azure.com:18884/apps/6628b461-75d4-4e4a-9897-ada4076dc30c/services/1a1bc4d9-3744-42ea-beec-00c84a423ce7/partitions/dc2dfff7-29ae-4e22-a0ed-185dffe099ea/replicas/131576284970258222s/, LSN: 841525, GlobalCommittedLsn: 841524, PartitionKeyRangeId: 24, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.15063 documentdb-netcore-sdk/1.7.1
   at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
   at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__2`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Client.DocumentClient.<ReadFeedAsync>d__254.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<GetFeedResponseAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<ReadDocumentChangeFeedPrivateAsync>d__15`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 327
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 393
[2018/01/24 02:10:55.468] [Information] Releasing lease for partition 24 due to an error, reason: Unknown!
[2018/01/24 02:10:55.489] [Information] Host 'changedemo' successfully removed PartitionId '24' with lease token '"11006070-0000-0000-0000-5a68e83f0000"' from currently owned partitions.
[2018/01/24 02:10:55.489] [Information] Host 'changedemo' closing event processor for PartitionId '24' and lease token '"11006070-0000-0000-0000-5a68e83f0000"' with reason 'Unknown'
[2018/01/24 02:10:55.515] [Information] Host 'changedemo' releasing partition 24...
[2018/01/24 02:10:55.517] Observer closed, 24
[2018/01/24 02:10:55.517] Reason for shutdown, Unknown
mkolt commented 6 years ago

Please try again. This is slightly different issue: with maxItemCount=1 there is long transaction that doesn't fit into max response size, that must come from script modifying lots of big documents as part of one transaction, and currently continuation token cannot address a part of a transaction. Soon this is going to be supported.

mkolt commented 6 years ago

Looks like it's fine now.

alohaninja commented 6 years ago

@mkolt looks like I'm still seeing this error An unknown error occurred while processing this request. We do use stored procedures to bulk update documents - is this likely the cause? Not sure where else we would ever have a large transaction of documents being changed all at once.

mkolt commented 6 years ago

@alohaninja looks like the partition that hit that again was able to move on. Yes, that's coming from the script. Let us know if you hit this again.

alohaninja commented 6 years ago

@mkolt - the partition only moves on because we manually update the state and null the ContinuationToken in the lease db collection. It will stay stuck at the ContinuationToken because the CFP is unable to move forward.

mkolt commented 6 years ago

@alohaninja Can you go back to problem state and try again?

alohaninja commented 6 years ago

@mkolt - I turned off our watchdog to auto-update the state and it looks like the CFP now continues from the previous reset point, updating the state=1 (unhealthy) to state=2 (healthy). Seems like there is some progress now, previously no docs were flowing through the change feed once state=1. The ContinuationToken also updates as the change feed processes from the state transition.

Lease Doc Reset

{
    "state": 1,
    "PartitionId": "24",
    "Owner": null,
    "ContinuationToken": "\"842525\"",
    "SequenceNumber": 1880
}

Current Lease Doc

{
    "state": 2,
    "PartitionId": "24",
    "Owner": "changedemo",
    "ContinuationToken": "\"843154\"",
    "SequenceNumber": 1889
}

I'll leave our watchdog ON, just reduce the frequency threshold to see if we have this fixed. Nice job!

alohaninja commented 6 years ago

@mkolt seeing this issue coming back today consistently with one of our monitored collections. Seems to be getting stuck as our watchdog is catching it and correcting it every 15 min instead of every 1 min. Is this related to a large transaction?

InternalServerErrorException - An unknown error occurred while processing this request

[2018/01/29 01:47:44.027] [Verbose] Worker task waiting for start signal: partition '7'
[2018/01/29 01:47:44.027] [Verbose] Worker task started: partition '7'
[2018/01/29 01:47:44.027] Observer opened for partition Key Range: 7
[2018/01/29 01:47:44.027] [Verbose] Worker start: partition '7', continuation '"309935"'
[2018/01/29 01:47:44.142] [Warning] Reducing maxItemCount, new value: 50.
[2018/01/29 01:47:49.257] [Warning] Reducing maxItemCount, new value: 25.
[2018/01/29 01:47:54.384] [Warning] Reducing maxItemCount, new value: 12.
[2018/01/29 01:47:59.497] [Warning] Reducing maxItemCount, new value: 6.
[2018/01/29 01:48:04.668] [Warning] Reducing maxItemCount, new value: 3.
[2018/01/29 01:48:09.782] [Warning] Reducing maxItemCount, new value: 1.
[2018/01/29 01:48:14.893] [Error] Partition 7 exception: Microsoft.Azure.Documents.InternalServerErrorException: Message: {"Errors":["An unknown error occurred while processing this request."]}
ActivityId: 9e5505d7-473f-4d85-8d4b-0aec95ebb01f, Request URI: /apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, RequestStats: 
ResponseTime: 2018-01-29T19:48:14.8828306Z, StoreReadResult: StorePhysicalAddress: rntbd://bn6prdapp03-docdb-1.documents.azure.com:14154/apps/6e6ae12d-b3be-4f96-8fe8-7853c6dfbd3b/services/e609b252-8550-4410-9403-55ba80cce3b8/partitions/60c27b83-d242-4619-9da3-f6a16f244416/replicas/131565695538142790s/, LSN: 310020, GlobalCommittedLsn: 310019, PartitionKeyRangeId: 7, IsValid: True, StatusCode: 0, IsGone: False, IsNotFound: False, RequestCharge: 1, ItemLSN: -1, ResourceType: Document, OperationType: ReadFeed
, SDK: Windows/10.0.16299 documentdb-netcore-sdk/1.7.1
   at Microsoft.Azure.Documents.StoreReadResult.ToResponse(RequestChargeTracker requestChargeTracker)
   at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass18_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass2_0`1.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__2`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<InvokeAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Client.DocumentClient.<ReadFeedAsync>d__254.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<GetFeedResponseAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.Linq.ChangeFeedQuery`1.<ReadDocumentChangeFeedPrivateAsync>d__15`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<>c__DisplayClass1_0.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetry>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 327
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 327
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at DocumentDB.ChangeFeedProcessor.ChangeFeedEventHost.<>c__DisplayClass29_0.<<DocumentDB-ChangeFeedProcessor-IPartitionObserver<DocumentDB-ChangeFeedProcessor-DocumentLeaseStore-DocumentServiceLease>-OnPartitionAcquiredAsync>b__0>d.MoveNext() in D:\temp\ChangeFeedDemo\DocumentDB.ChangeFeedProcessor\ChangeFeedEventHost.cs:line 393
[2018/01/29 01:48:14.893] [Information] Releasing lease for partition 7 due to an error, reason: Unknown!
[2018/01/29 01:48:14.893] [Information] Host 'changedemo' successfully removed PartitionId '7' with lease token '"3b002cdb-0000-0000-0000-5a6f7a700000"' from currently owned partitions.
[2018/01/29 01:48:14.893] [Information] Host 'changedemo' closing event processor for PartitionId '7' and lease token '"3b002cdb-0000-0000-0000-5a6f7a700000"' with reason 'Unknown'
[2018/01/29 01:48:14.893] [Information] Host 'changedemo' releasing partition 7...
[2018/01/29 01:48:14.893] Observer closed, 7
[2018/01/29 01:48:14.893] Reason for shutdown, Unknown
[2018/01/29 01:48:14.894] [Information] Partition 7: worker finished!
mkolt commented 6 years ago

@alohaninja Yes, large transaction, different account (now test). Should be mitigated now.

alohaninja commented 6 years ago

Still seeing the issue daily where the lease needs reset, our watchdog script is still resetting invalid states - likely due to a large transaction update. I will be refactoring the large transaction to occur outside a stored procedure to see if this improves until we get large transaction support in CFP.

The stored procedure is exhausting server-bound resources as it occasionally receives "the script with id 'xxxxx is blocked for execution because it has violated its allowed resource limit several times." so it needs redesigned as a client-side script, rather than calling it repeatedly with continuation tokens as it is taking too long to execute with server-bound resources.

mkolt commented 6 years ago

There don't seem to be any signals of stuck partitions during last 3 days. We can take a quick look at your sproc, if you want. Share via email.

alohaninja commented 6 years ago

@mkolt I just shared the sproc via email, hope you can identify any issues we have.

I tried refactoring the large transaction (server-side sproc) which bulk updates to client-side, but it takes 2-3x the time (updating 6k 100KB docs takes 5-6min server-side and 15-20min client-side). Unfortunately there is no way to do partial document updates yet so client-side is taking a penalty for retrieving and updating the entire document each time, adding a proportionate amount of latency with each client/server cycle.

I can consistently fail the CFP when sprocs exceed their bounded execution or process a large batch of bulk updates, presumably related to larger-sized docs. Until this can be remedied, we have to keep our lease watchdog actively fixing broken partition leases.

syedhassaanahmed commented 6 years ago

@ealsur Not sure if its related to this issue but in my case, change feed doesn't get triggered the first time, until I manually visit the Function App -> Change Feed Function -> function.json -> Logs. This change feed is being observed on a partitioned collection with 25k RUs. Function is hosted on consumption plan with a single lease (leases collection: 400 RUs). Once it gets triggered it works fine afterwards but the first time I always have to navigate to the Logs as mentioned above.

The behavior is not tied to a specific Collection or Cosmos DB account but always reproducible by deploying this ARM template and running these load tests. The trigger is configured in code here.

ealsur commented 6 years ago

@syedhassaanahmed please open a different issue, this is not related to SPROCs. The Consumption Plan is not creating instances for your Function App (that is why it's not triggering). When you open the Functions Portal, it forces the creation of one instance if there are none, and that is when it works.

syedhassaanahmed commented 6 years ago

Thanks @ealsur #2549

pvmraghunandan commented 6 years ago

Any update? Also, is there any direction to someone starting new to not face such issues?

ealsur commented 6 years ago

There is a PR that updates the CFP Library to a version that can handle batch size better for these scenarios. https://github.com/Azure/azure-webjobs-sdk-extensions/pull/410

pvmraghunandan commented 6 years ago

Thanks. Any idea on next version of nuget to be released?

ealsur commented 6 years ago

Since the fix is targeting V2, I don't know the release schedule. Maybe @paulbatum can shed more light on what could be an estimated time frame.

paulbatum commented 6 years ago

Since this PR is still pending review, I would guess the ETA for this to be merged and then deployed to prod to be around mid June.

ryancrawcour commented 6 years ago

@ealsur you say there is a fix targeting V2, what about customers running V1 Functions that are running in to this problem? what's the mitigation for them?

ealsur commented 6 years ago

@ryancrawcour V1 dependencies cannot be upgrade futher due to a couple of hard lock dependencies on Functions V1 (Newtonsoft.Json & DocumentDB SDK).

ryancrawcour commented 6 years ago

What's the solution for customers running on V1?