dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.08k stars 2.03k forks source link

Backoff for retries when Azure Cosmos DB Table Storage calls fail because of RU limits #9071

Open pdehne-steidle opened 3 months ago

pdehne-steidle commented 3 months ago

We had an incident where queries to Azure Cosmos DB Table Storage API was throttled because RU limits where configured.

This made Orleans queries fail. Then Orleans retried these queries as fast as it could without backoff. It looked like an endless loop. The Silos where up and running at this point. Therefore the retried table storage queries may have been part of clustering internal queries to Table Storage.

Would it be good idea for Orleans to add some back off for these retries? I imagine once grains have been activated, reminders created while maybe even doing some throttling for clustering internal calls during Silo startup to stay within DU limits, the Silos could leave the startup spike behind them, and resume standard operations within the configured DU limits.

Here is a screenshot of Azure Log Analytics where the failed calls have been logged: image

This happened with Orleans 8.1.

pdehne-steidle commented 2 months ago

We had a related issue today. The Azure Kubernetes Cluster made an Update. Orleans restartet and again, the ASP.NET Core Endpoint which is a Orleans Client Pod startet to query OrleansSiloInstances very fast.

This time there was no RU limit issue, it was just the Kubernetes and Client / Silo restarts.

Is there any way to work around this?

image

Looking into the Table Storage Logs I get a huge amount of queries (> 1 million in a few hours) like this:

{
  "time": "2024-08-04T07:21:03.5300448Z",
  "operationName": "Query",
  "tableName": "OrleansSiloInstances",
  "address": "10.240.0.5"
}
{
  "time": "2024-08-04T07:21:03.6249458Z",
  "operationName": "Query",
  "tableName": "OrleansSiloInstances",
  "address": "10.240.0.5"
}
{
  "time": "2024-08-04T07:21:03.6253174Z",
  "operationName": "Query",
  "tableName": "OrleansSiloInstances",
  "address": "10.240.0.5"
}

Restarting the ASP.NET Core Endpoint Pod fixed the endless loop.