Change Feed Processor Host fails to renew the lease under heavy load

Azure / azure-cosmos-dotnet-v2

Contains samples and utilities relating to the Azure Cosmos DB .NET SDK

MIT License

577 stars 838 forks source link

Change Feed Processor Host fails to renew the lease under heavy load #237

Open ghost opened 7 years ago

ghost commented 7 years ago

Change Feed Processor Host fails to renew its own lease under heavy load, e.g. a lot of documents in the feed. Imaging a scenario where you have a lot of documents in the feed and only one host running. One thread receives change feed and does checkpoints very frequently (updates the lease for the host). The LeaseRenewer thread tries to renew the lease for same host every 17 seconds, but fails after 5 reties since every try lease is already overwritten by checkpoint. So the LeaseRenewer fails to renew the lease and closes the observer with the reason "lease lost". Which is undesirable and does not makes much sense for our use cases.

Maybe more serious issue is when LeaseRenewer succeeds to update the lease and then checkpoint fails to renew the lease.

Is such behavior expected? Could please suggest the recommended way to resolve the issue?

See https://github.com/Azure/azure-documentdb-dotnet/tree/master/samples/ChangeFeedProcessor

ghost commented 7 years ago

So as a side effect of this IChangeFeedObserver gets closed (CloseAsync) and still continues to receive ProcessChangesAsync calls, e. g.

apollak commented 7 years ago

IMHO there should not be any calls to ProcessChangeAsync after a CloseAsync call. At some point in time (after the CloseAsync) the Factory creates a new IChangeFeedObserver. But in the meantime it is continuing to process data with the "closed" FeedObserver. I was able to reproduce what you describe.

ealsur commented 7 years ago

@apollak Are you using the Change Feed Processor Nuget?

apollak commented 7 years ago

@ealsur Yes I am using the Change Feed Processor Nuget. I uploaded demo code to my GitHubRepo yesterday (To avoid confusion: The code uses direct and ChangeFeedProcessor for demonstrations). See: https://github.com/SpectoLogic/CosmosDBVienna2017/blob/master/CompareAPI/CompareAPI/ChangeFeedDemo/DocumentFeedObserver.cs (CloseAsync-Method).

kirankumarkolli commented 7 years ago

@ealsur are you able to repro it?

ealsur commented 7 years ago

@kirankumarkolli Not really, I tested the library hooked to a collection with incoming sensor data and did not see it fail but maybe I'm not hitting "heavy load"? @skram-softheme Is the Leases collection located in the same database and service as the collection being monitored? If not, is the geographical distribution close to the location of the process working with the Change Feed?

ghost commented 7 years ago

@ealsur yes, both collections are in same service/database. It reproduces very easily. You just need a collection with large change feed (ideally as little documents as possible per change feed), high RU/s provisioned. So the thread that processes feed does checkpoints a few times per second. Then LeaseRenewer thread fails to update lease because every try etag does not match. So it fails and closes observer, while worker thread continues to call ProcessChangesAsync.

ealsur commented 7 years ago

@skram-softheme Your lease collection is not partitioned, right? Is your monitored collection partitioned? Could it be creating too many partition splits?

ghost commented 7 years ago

@ealsur Lease collection is not partitioned, monitored collection is partitioned, but it has only one physical partition, in fact.

apollak commented 7 years ago

@ealsur You can take my ready to run sample to reproduce. It just requires to create a CosmosDB instance and you need to populate Account_DemoBuild_Mongo and Account_DemoBuild_Mongo_Key in App.config. Run the sample - After 30-60 seconds you will hit a break point in DocumentFeedObserver.cs - CloseAsync-Method. The sample creates a database "demodb" and two collections:

personcol (400 RU/S, partitioned "/city")
leasecol (400 RU/s, non-partitioned)

ealsur commented 7 years ago

@apollak Thanks! I finally saw it. Seems like a race-condition of some sort. I'll share it internally.

apollak commented 7 years ago

@ealsur Any update on this?