Open ghost opened 7 years ago
So as a side effect of this IChangeFeedObserver gets closed (CloseAsync) and still continues to receive ProcessChangesAsync calls, e. g.
IMHO there should not be any calls to ProcessChangeAsync after a CloseAsync call. At some point in time (after the CloseAsync) the Factory creates a new IChangeFeedObserver. But in the meantime it is continuing to process data with the "closed" FeedObserver. I was able to reproduce what you describe.
@apollak Are you using the Change Feed Processor Nuget?
@ealsur Yes I am using the Change Feed Processor Nuget. I uploaded demo code to my GitHubRepo yesterday (To avoid confusion: The code uses direct and ChangeFeedProcessor for demonstrations). See: https://github.com/SpectoLogic/CosmosDBVienna2017/blob/master/CompareAPI/CompareAPI/ChangeFeedDemo/DocumentFeedObserver.cs (CloseAsync-Method).
@ealsur are you able to repro it?
@kirankumarkolli Not really, I tested the library hooked to a collection with incoming sensor data and did not see it fail but maybe I'm not hitting "heavy load"? @skram-softheme Is the Leases collection located in the same database and service as the collection being monitored? If not, is the geographical distribution close to the location of the process working with the Change Feed?
@ealsur yes, both collections are in same service/database. It reproduces very easily. You just need a collection with large change feed (ideally as little documents as possible per change feed), high RU/s provisioned. So the thread that processes feed does checkpoints a few times per second. Then LeaseRenewer thread fails to update lease because every try etag does not match. So it fails and closes observer, while worker thread continues to call ProcessChangesAsync.
@skram-softheme Your lease collection is not partitioned, right? Is your monitored collection partitioned? Could it be creating too many partition splits?
@ealsur Lease collection is not partitioned, monitored collection is partitioned, but it has only one physical partition, in fact.
@ealsur You can take my ready to run sample to reproduce. It just requires to create a CosmosDB instance and you need to populate Account_DemoBuild_Mongo and Account_DemoBuild_Mongo_Key in App.config. Run the sample - After 30-60 seconds you will hit a break point in DocumentFeedObserver.cs - CloseAsync-Method. The sample creates a database "demodb" and two collections:
@apollak Thanks! I finally saw it. Seems like a race-condition of some sort. I'll share it internally.
@ealsur Any update on this?
Change Feed Processor Host fails to renew its own lease under heavy load, e.g. a lot of documents in the feed. Imaging a scenario where you have a lot of documents in the feed and only one host running. One thread receives change feed and does checkpoints very frequently (updates the lease for the host). The LeaseRenewer thread tries to renew the lease for same host every 17 seconds, but fails after 5 reties since every try lease is already overwritten by checkpoint. So the LeaseRenewer fails to renew the lease and closes the observer with the reason "lease lost". Which is undesirable and does not makes much sense for our use cases.
Maybe more serious issue is when LeaseRenewer succeeds to update the lease and then checkpoint fails to renew the lease.
Is such behavior expected? Could please suggest the recommended way to resolve the issue?
See https://github.com/Azure/azure-documentdb-dotnet/tree/master/samples/ChangeFeedProcessor