Possible concurrency issues with caches?

JamalQasem commented 1 month ago

We are using SqlKata/Dapper(2.1.35) to implement a Web-API for a Postgres database with around 20 tables. The API contains methods to query and manipulate data. It does not use stored procedures. We test the repository methods of the API with around 200 unit tests, which run transactions against a test database rolling them back instead of committing.

Recently one of our tests started failing with the following error message: Npgsql.PostgresException : 23503: insert or update on table "___" violates foreign key constraint "___"

We then annotated the offending test case with

[TestFixture, RequiresThread(ApartmentState.STA)]

and the problem vanished. All our repository methods rely on functions like

    protected async Task<IEnumerable<TEntity>> QueryAsync(Query query, IDbTransaction transaction, CancellationToken token)
    {
        var command = CompileQuery(query, transaction, token);
        return await transaction.Connection!.QueryAsync<TEntity>(command).ConfigureAwait(false);
    }

    private async Task<int> ExecuteAsync(Query query, IDbTransaction transaction, CancellationToken token)
    {
        var command = CompileQuery(query, transaction, token);
        return await transaction.Connection!.ExecuteAsync(command).ConfigureAwait(false);        
    }

We first checked that the queries use different connections of the connection pool and experimented with the command flag NoCacheto no avail. We finally decided to serialize the access using a semaphore. This together with a call of SqlMapper.PurgeQueryCache(); made the failing test working again in MT.

The current implementation looks as follows

    protected async Task<IEnumerable<TEntity>> QueryAsync(Query query, IDbTransaction transaction, CancellationToken token)
    {
        await _semaphore.WaitAsync().ConfigureAwait(false);
        try
        {
            var command = CompileQuery(query, transaction, token);
            return await transaction.Connection!.QueryAsync<TEntity>(command).ConfigureAwait(false);
        }
        finally
        {
            SqlMapper.PurgeQueryCache();
            _semaphore.Release();
        }
    }

We suspect that we somehow manage to use different statements which map to the same cache Identity so that concurrently running queries may use a wrong cache entry. Leaving us with 2 unanswered questions:

Did anyone run into the same issue?
Is there a way to completely avoid caching?

mgravell commented 1 month ago

Hmm; this is odd. There are ways to bypass all the ref-emit caching; the one I'd say "try first", however, is the DapperAOT work. This is a little hard to opine on in this specific case because of your Query type, which is an unknown to me. The other option is the NoCache flag, which you say you have already been using.

However, I wonder whether we've ruled out simpler causes, such as the data genuinely violating FK rules for data reasons, unrelated to Dapper.

JamalQasem commented 1 month ago

Thank you very much for your swift reply.

Hmm; this is odd. There are ways to bypass all the ref-emit caching; the one I'd say "try first", however, is the DapperAOT work.

As far as we know DapperAOT does not support all functionalities that Dapper provides.

This is a little hard to opine on in this specific case because of your Query type, which is an unknown to me.

The used Query type originates from the SqlKata query builder and generates a SQL command string which is injected in the Dapper command definition like so:

var statement = _compiler.Compile(query).ToString();
return new CommandDefinition(statement, transaction:transaction, 
        cancellationToken:token, flags:flags);

The other option is the NoCache flag, which you say you have already been using.

NoCache seems to work for us with PR #2118.

However, I wonder whether we've ruled out simpler causes, such as the data genuinely violating FK rules for data reasons, unrelated to Dapper.

The FK violation is the result of the previous call to MultiMapAsync silently returning wrong data.

goerch commented 1 month ago

We suspect that we somehow manage to use different statements which map to the same cache Identity so that concurrently running queries may use a wrong cache entry.

@mgravell: we were able to reduce the failing scenario to two test cases running the same statement concurrently. Unfortunately these seem to have identical Dapper Identity hash codes although the associated DbDataReaders are different. It looks to me as if SqlMapper's GetColumnHash() computation doesn't incooperate the actual reader? We would be grateful about advice how to proceed.

JamalQasem commented 1 month ago

Unfortunately these seem to have identical Dapper Identity hash codes although the associated DbDataReaders are different. It looks to me as if SqlMapper's GetColumnHash() computation doesn't incooperate the actual reader?

A change like

        private static int GetColumnHash(DbDataReader reader, int startBound = 0, int length = -1)
        {
            unchecked
            {
                int max = length < 0 ? reader.FieldCount : startBound + length;
                int hash = -37 * (Thread.CurrentThread.GetHashCode() + startBound) + max;
                for (int i = startBound; i < max; i++)
                {
                    object tmp = reader.GetName(i);
                    hash = (-79 * ((hash * 31) + (tmp?.GetHashCode() ?? 0))) + (reader.GetFieldType(i)?.GetHashCode() ?? 0);
                }
                return hash;
            }
        }

passes the Dapper test suite and seems to solve our original problem.

mgravell commented 1 month ago

@JamalQasem incorprating a thread-identifier into the hash simply masks the problem - it is artificially forcing more lookup misses, by only allowing requests on the same thread to ever equate - however, if there is an equality collision problem, it does nothing to prevent the actual problem if it occurs from the same thread. So: the worst of every outcome, even if it passes tests.

However: I am very open to us needing to radically rework the check here. Identity already incorporates a number of high-level checks including the SQL, parameter type, result type, and even connection-string metadata (in case of different databases having different schemas). The column-hash here is intended as a final smoke test along the lines of "did somebody run a DDL command on the database and change the schema in between two commands?". But: if that isn't working well in a scenario: let's explore that and fix it.

So: can I get more info about the scenario here? In particular, because Identity already includes those aforementioned checks, the only way I can imagine this happening is with a multi-purpose stored procedure that returns radically different shapes for different inputs, i.e. where MyProc @Mode=1, @Name='abc', @Id=123 returns radically different columns to MyProc @Mode=2,@Name='abc',@Id=123, and where by some complete fluke: the column hash comes back as identical.

The "real" solution here is to radically rework the way the column lookup works: we've already done that work over in AOT, which is why I say that if possible, "move to AOT" may be a quick fix. It is stupendously hard to fully rework that into Dapper "vanilla" (see https://github.com/DapperLib/Dapper/issues/1909), however, we might be able to do some compromise that fixes the underlying problem. For example, instead of a column hash, we could store a string of the column type data for "new" records, compare against that full string (ideally without actually allocating a string each test time, but that's an implementation detail - either via forwards crawl, or "generate to a buffer, compare buffers").

So: @goerch - I want to explore this, but to do that: do you have a concrete example I can look at, that fails?. In particular, is my "SP that returns different things in different contexts" guess even remotely correct? This would also seem to be consistent with your mention of "the same statement concurrently"; again, normally expects that the same statement, with the same input and output types, is going to have the same fundamental shape. If that expectation is wrong: we can fix it.

goerch commented 1 month ago

@mgravell: many thanks for your support!

do you have a concrete example I can look at, that fails?

We are currently lucky(?) enough that our problem occurs rather deterministically, which is not a given in MT enviroments. We'll next try to reduce and anonymize the tests, which will need a bit of time because some infrastructure components are involved.

goerch commented 1 month ago

@mgravell: we have published a reduced test case which hopefully helps to reproduce the issue.

goerch commented 1 month ago

Unfortunately our analysis so far seems to be incomplete or wrong: even with full serialization and purging the cache we started to see the problem in our production code. I'm out of my depth trying to investigate the details of Dapper deserialization. Our workaround for now is to use a simple implementation of the needed Dapper interfaces using ADO.NET and basic reflection.

DapperLib / Dapper

Possible concurrency issues with caches? #2117