Open Mortana89 opened 4 years ago
If you are getting throttled on Get Collection there is something that points to you doing a lot of metadata (Collection/Database) requests. It's not a normal situation, and might indicate an issue with the application logic. When you provision RU, you do it for item operations within a collection or database, but metadata requests have a very small RU budget of their own.
Why are you calling Get Collection so much?
On the logging side, we are currently sending information and the diagnostics to the Azure Core pipeline (https://docs.microsoft.com/en-us/dotnet/azure/sdk/logging), do you have logging enabled?
Good point, wasn't aware that this would be metadata requests! Upon checking in the portal I can indeed see a few hundred meta data requests being throttled;
The only thing I can think of why this is happening, is because we are firing many parallel http requests from our front-end, which are all seperate threads in the end... We use a wrapper around the cosmosclient, nothing fancy, but it's registered as a singleton, and is used to get 'containers' from in all our 'query handlers':
`
public class CosmosClientFactory : ICosmosClientFactory { private static ConcurrentDictionary<ICosmosDatabaseSettings, CosmosDatabase> _databases = new ConcurrentDictionary<ICosmosDatabaseSettings, CosmosDatabase>();
private static ConcurrentDictionary<CosmosDatabase, ConcurrentDictionary<string, CosmosContainer>> _containers =
new ConcurrentDictionary<CosmosDatabase, ConcurrentDictionary<string, CosmosContainer>>();
private List<ICosmosDatabaseSettings> _dataSources;
private CosmosClient _client;
private readonly string _applicationName;
private readonly string _accountEndpoint;
private readonly string _accountAuthKey;
public CosmosClientFactory(List<ICosmosDatabaseSettings> dataSources, IOptions<JsonSerializerOptions> jsonSerializerOptions,
string applicationName, string endpoint, string authKey)
{
_dataSources = dataSources ?? throw new ArgumentNullException(nameof(dataSources));
_applicationName = applicationName;
_accountEndpoint = endpoint;
_accountAuthKey = authKey;
_client = new CosmosClientBuilder(_accountEndpoint, _accountAuthKey)
.WithCustomSerializer(new CustomCosmosSerializer(jsonSerializerOptions))
.WithApplicationName(_applicationName)
.WithConnectionModeDirect()
.Build();
}
public CosmosDatabase GetDatabase<TDataSource>()
where TDataSource : ICosmosDatabaseSettings
{
return GetDatabase(typeof(TDataSource));
}
public CosmosDatabase GetDatabase(Type dataSourceType)
{
var dataSource = _dataSources.SingleOrDefault(x => x.GetType() == dataSourceType);
if (dataSource == null)
throw new ArgumentException("Could not find the questioned datasource: " + dataSourceType.AssemblyQualifiedName);
return _databases.GetOrAdd(dataSource, x => InitializeDatabase(dataSource).GetAwaiter().GetResult());
}
public CosmosContainer GetContainer<TDataSource>(string containerName)
where TDataSource : ICosmosDatabaseSettings
{
return GetContainer(typeof(TDataSource), containerName);
}
public CosmosContainer GetContainer(Type dataSourceType, string containerName)
{
var database = GetDatabase(dataSourceType);
var dataSource = _dataSources.SingleOrDefault(x => x.GetType() == dataSourceType);
var containersInDatabase = _containers.GetOrAdd(database, x =>
{
return new ConcurrentDictionary<string, CosmosContainer>();
});
return containersInDatabase.GetOrAdd(containerName, x => InitializeContainer(dataSource, database, containerName).GetAwaiter().GetResult());
}
private async Task<CosmosDatabase> InitializeDatabase(ICosmosDatabaseSettings databaseSettings)
{
var database = await _client.CreateDatabaseIfNotExistsAsync(databaseSettings.DatabaseName, 400);
return database.Database;
}
private async Task<CosmosContainer> InitializeContainer(ICosmosDatabaseSettings databaseSettings, CosmosDatabase database, string containerName)
{
var container = database.DefineContainer(containerName, "/Discriminator");
if (databaseSettings.DefaultTimeToLive != null)
container.WithDefaultTimeToLive(TimeSpan.FromSeconds(databaseSettings.DefaultTimeToLive.Value));
return await container.WithIndexingPolicy()
.WithIndexingMode(IndexingMode.Consistent)
.WithIncludedPaths()
.Path("/*")
.Attach()
.Attach()
.CreateIfNotExistsAsync();
}
}`
We are using collection-per-tenant setup, having multiple seperate databases (functional domains), so: DB1
The problem here is that you might be calling GetOrAdd concurrently from multiple threads, and they would all be doing CreateIfNotExistsAsync, it's quite a lot of requests.
Why not keep a cache of initialized containers on a ConcurrentDictionary and make sure you are only initializing once and then returning the CosmosContainer instance afterwards? All those Get Collection are adding latency to all your operations.
We're doing it concurrently indeed, hence we already used a ConcurrentDictionary which should be threadsafe, or am I misunderstanding you?
Mmmm then why are you seeing all these Get Collections? Unless you have hundreds of collections? Otherwise there should be only one request per collection.
Unless for some reason, multiple instances of the Factory are being created?
Currently we have like 30 collections or something, and one request can involve a few (max 3-4) collections. We also have multiple microservices, but we have a singleton for the factory :/ Seems like something is not behaving thread-safe then...
Then you should not see more than 30 requests (max 60 if those containers did not exist due to te Get/Create). But from the charts, it looks like you had > 400 throttles. Maybe each microservice has its own factory and they are all trying to initialize the containers at the same time (so 30 x AmountOfMicroservices?).
Could be indeed, this will mostly occur right after a production release, when the microservices are booting up again and are processing requests... We ported this class from back when we were using Cosmonaut, to have a local cache of containers/databases. Would it still be required though with v4?
The key here is if you expect those Containers to be deleted. Otherwise a call to the proxy GetContainer
does no network calls.
No, if we delete a container we'll do it explicitly, which will also call the corresponding apis (delete and createifnotexists) methods.
Then could you maintain the result of GetContainer instead of the CreateIfNotExists ? That would all the network calls.
database.GetContainer(containerName); ? Will this throw an error if it doesn't exist?
client.GetContainer or database.GetContainer. They both return a proxy instance that can be used to execute operations.
If the container does not exist, the operations (for example, ReadItem) will return a 404.
Using the v4-preview3 SDK; We noticed we had a few hundred 429 dependency failures logged to application insights;
Upon further investigation, it looks as if the SDK doesn't use the retry policy on CreateContainerIfNotExistsAsync (and probably on the database equivalent as well)
Is this supposed to log these 429s? Other 429s occurring as part of a query are not logged to AI, hence the question. We use our own Polly policies to retry upon CosmosExceptions holding 429, but I don't see this being thrown in this path, could be that I'm not looking straight though.
Next to that the dependency logs, we see many 'Azure.Cosmos-Query-Document' events tohether with Http.Request, but they don't hold much. Was hoping it would hold the actual paths + RU charges;
Are we doing something wrong or is this supposed to look like that?