Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
731 stars 489 forks source link

[vNext] Dependency failure logged in case of 429? #1346

Open Mortana89 opened 4 years ago

Mortana89 commented 4 years ago

Using the v4-preview3 SDK; We noticed we had a few hundred 429 dependency failures logged to application insights;

image

Upon further investigation, it looks as if the SDK doesn't use the retry policy on CreateContainerIfNotExistsAsync (and probably on the database equivalent as well)

Is this supposed to log these 429s? Other 429s occurring as part of a query are not logged to AI, hence the question. We use our own Polly policies to retry upon CosmosExceptions holding 429, but I don't see this being thrown in this path, could be that I'm not looking straight though.

Next to that the dependency logs, we see many 'Azure.Cosmos-Query-Document' events tohether with Http.Request, but they don't hold much. Was hoping it would hold the actual paths + RU charges; image

Are we doing something wrong or is this supposed to look like that?

ealsur commented 4 years ago

If you are getting throttled on Get Collection there is something that points to you doing a lot of metadata (Collection/Database) requests. It's not a normal situation, and might indicate an issue with the application logic. When you provision RU, you do it for item operations within a collection or database, but metadata requests have a very small RU budget of their own.

Why are you calling Get Collection so much?

On the logging side, we are currently sending information and the diagnostics to the Azure Core pipeline (https://docs.microsoft.com/en-us/dotnet/azure/sdk/logging), do you have logging enabled?

Mortana89 commented 4 years ago

Good point, wasn't aware that this would be metadata requests! Upon checking in the portal I can indeed see a few hundred meta data requests being throttled; image

The only thing I can think of why this is happening, is because we are firing many parallel http requests from our front-end, which are all seperate threads in the end... We use a wrapper around the cosmosclient, nothing fancy, but it's registered as a singleton, and is used to get 'containers' from in all our 'query handlers':

`

public class CosmosClientFactory : ICosmosClientFactory { private static ConcurrentDictionary<ICosmosDatabaseSettings, CosmosDatabase> _databases = new ConcurrentDictionary<ICosmosDatabaseSettings, CosmosDatabase>();

    private static ConcurrentDictionary<CosmosDatabase, ConcurrentDictionary<string, CosmosContainer>> _containers =
       new ConcurrentDictionary<CosmosDatabase, ConcurrentDictionary<string, CosmosContainer>>();

    private List<ICosmosDatabaseSettings> _dataSources;
    private CosmosClient _client;

    private readonly string _applicationName;
    private readonly string _accountEndpoint;
    private readonly string _accountAuthKey;

    public CosmosClientFactory(List<ICosmosDatabaseSettings> dataSources, IOptions<JsonSerializerOptions> jsonSerializerOptions,
        string applicationName, string endpoint, string authKey)
    {
        _dataSources = dataSources ?? throw new ArgumentNullException(nameof(dataSources));
        _applicationName = applicationName;
        _accountEndpoint = endpoint;
        _accountAuthKey = authKey;

        _client = new CosmosClientBuilder(_accountEndpoint, _accountAuthKey)
            .WithCustomSerializer(new CustomCosmosSerializer(jsonSerializerOptions))
            .WithApplicationName(_applicationName)
            .WithConnectionModeDirect()
            .Build();
    }

    public CosmosDatabase GetDatabase<TDataSource>()
        where TDataSource : ICosmosDatabaseSettings
    {
        return GetDatabase(typeof(TDataSource));
    }

    public CosmosDatabase GetDatabase(Type dataSourceType)
    {
        var dataSource = _dataSources.SingleOrDefault(x => x.GetType() == dataSourceType);

        if (dataSource == null)
            throw new ArgumentException("Could not find the questioned datasource: " + dataSourceType.AssemblyQualifiedName);

        return _databases.GetOrAdd(dataSource, x => InitializeDatabase(dataSource).GetAwaiter().GetResult());
    }

    public CosmosContainer GetContainer<TDataSource>(string containerName)
        where TDataSource : ICosmosDatabaseSettings
    {
        return GetContainer(typeof(TDataSource), containerName);
    }

    public CosmosContainer GetContainer(Type dataSourceType, string containerName)
    {
        var database = GetDatabase(dataSourceType);
        var dataSource = _dataSources.SingleOrDefault(x => x.GetType() == dataSourceType);

        var containersInDatabase = _containers.GetOrAdd(database, x =>
        {
            return new ConcurrentDictionary<string, CosmosContainer>();
        });

        return containersInDatabase.GetOrAdd(containerName, x => InitializeContainer(dataSource, database, containerName).GetAwaiter().GetResult());

    }
    private async Task<CosmosDatabase> InitializeDatabase(ICosmosDatabaseSettings databaseSettings)
    {
        var database = await _client.CreateDatabaseIfNotExistsAsync(databaseSettings.DatabaseName, 400);

        return database.Database;
    }

    private async Task<CosmosContainer> InitializeContainer(ICosmosDatabaseSettings databaseSettings, CosmosDatabase database, string containerName)
    {
        var container = database.DefineContainer(containerName, "/Discriminator");

        if (databaseSettings.DefaultTimeToLive != null)
            container.WithDefaultTimeToLive(TimeSpan.FromSeconds(databaseSettings.DefaultTimeToLive.Value));

        return await container.WithIndexingPolicy()
            .WithIndexingMode(IndexingMode.Consistent)
            .WithIncludedPaths()
                .Path("/*")
                .Attach()
            .Attach()
        .CreateIfNotExistsAsync();
    }
}`

We are using collection-per-tenant setup, having multiple seperate databases (functional domains), so: DB1

ealsur commented 4 years ago

The problem here is that you might be calling GetOrAdd concurrently from multiple threads, and they would all be doing CreateIfNotExistsAsync, it's quite a lot of requests.

Why not keep a cache of initialized containers on a ConcurrentDictionary and make sure you are only initializing once and then returning the CosmosContainer instance afterwards? All those Get Collection are adding latency to all your operations.

Mortana89 commented 4 years ago

We're doing it concurrently indeed, hence we already used a ConcurrentDictionary which should be threadsafe, or am I misunderstanding you?

ealsur commented 4 years ago

Mmmm then why are you seeing all these Get Collections? Unless you have hundreds of collections? Otherwise there should be only one request per collection.

Unless for some reason, multiple instances of the Factory are being created?

Mortana89 commented 4 years ago

Currently we have like 30 collections or something, and one request can involve a few (max 3-4) collections. We also have multiple microservices, but we have a singleton for the factory :/ Seems like something is not behaving thread-safe then...

ealsur commented 4 years ago

Then you should not see more than 30 requests (max 60 if those containers did not exist due to te Get/Create). But from the charts, it looks like you had > 400 throttles. Maybe each microservice has its own factory and they are all trying to initialize the containers at the same time (so 30 x AmountOfMicroservices?).

Mortana89 commented 4 years ago

Could be indeed, this will mostly occur right after a production release, when the microservices are booting up again and are processing requests... We ported this class from back when we were using Cosmonaut, to have a local cache of containers/databases. Would it still be required though with v4?

ealsur commented 4 years ago

The key here is if you expect those Containers to be deleted. Otherwise a call to the proxy GetContainer does no network calls.

Mortana89 commented 4 years ago

No, if we delete a container we'll do it explicitly, which will also call the corresponding apis (delete and createifnotexists) methods.

ealsur commented 4 years ago

Then could you maintain the result of GetContainer instead of the CreateIfNotExists ? That would all the network calls.

Mortana89 commented 4 years ago

database.GetContainer(containerName); ? Will this throw an error if it doesn't exist?

ealsur commented 4 years ago

client.GetContainer or database.GetContainer. They both return a proxy instance that can be used to execute operations.

If the container does not exist, the operations (for example, ReadItem) will return a 404.