bitnine-oss / agensgraph

AgensGraph, a transactional graph database based on PostgreSQL
http://www.agensgraph.org
Other
1.32k stars 146 forks source link

BUG - could not serialize access due to concurrent update #605

Closed aviNrich closed 1 year ago

aviNrich commented 1 year ago

Hi, Im running my POC on branch https://github.com/bitnine-oss/agensgraph/tree/v2.14 when running the query from nodejs:

MERGE (n:entities { text : 'The Text', label : 'PERSON' })
        ON MATCH
             SET
                     n.originId='43' ,   n.type='null', n.text='The Text' ,   n.label='PERSON'  
          ON CREATE
              SET
                      n.originId='43' , n.type='null',   n.text='The Text',   n.label='PERSON'

RETURN n

Error:

{
  "length": 123,
  "name": "error",
  "severity": "ERROR",
  "code": "40001",
  "file": "execCypherSet.c",
  "line": "609",
  "routine": "LegacyUpdateElemProp"
}
error: could not serialize access due to concurrent update
    at Parser.parseErrorMessage (/home/avi/work/backend/node_modules/pg-protocol/src/parser.ts:369:69)
    at Parser.handlePacket (/home/avi/work/backend/node_modules/pg-protocol/src/parser.ts:188:21)
    at Parser.parse (/home/avi/work/backend/node_modules/pg-protocol/src/parser.ts:103:30)
    at Socket.<anonymous> (/home/avi/work/backend/node_modules/pg-protocol/src/index.ts:7:48)
    at Socket.emit (node:events:527:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)
    at TCP.callbackTrampoline (node:internal/async_hooks:130:17)

Im using https://github.com/bitnine-oss/agensgraph-nodejs . This error also generated when using, pg npm package directly.

I should mention that it is inside a transaction SET TRANSACTION ISOLATION LEVEL SERIALIZABLE.

I Assume its something related to isolation level...

Any help would be appreciated. thanks

samoscyberallenh commented 1 year ago

I've been able to reproduce this using two simple transactions. First, create a VLABEL named "test" and a single vertex with that label:

create vlabel test;
create (:test {id: 1});

Then in two separate sessions, run these commands:

Transaction 1                                       Transaction 2
-------------                                       -------------
begin;
match (t:test {id: 1}) set t.description = 'test';
                                                    begin;
                                                    match (t:test {id: 1}) set t.description = 'other';
                                                    <transaction blocks here as expected>
commit;
                                                    ERROR:  could not serialize access due to concurrent update

You can get a similar but different message if you use DELETE instead of SET:

Transaction 1                                       Transaction 2
-------------                                       -------------
begin;
match (t:test {id: 1}) delete t';
                                                    begin;
                                                    match (t:test {id: 1}) delete t;
                                                    <transaction blocks here as expected>
commit;
                                                    ERROR:  unrecognized heap_update status: 4

These may be separate issues, but the messages are emitted from similar parts of the code. Note that in both cases there are TODO comments (in src/backend/executor/nodeModifyGraph.c).

The first one is likely coming from the function updateElemProp:

        case TM_Updated:
            /* TODO: A solution to concurrent update is needed. */
            ereport(ERROR,
                    (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                     errmsg("could not serialize access due to concurrent update")));
            break;
        default:
            elog(ERROR, "unrecognized heap_update status: %u", result);

The second one is likely coming from a similar switch statement in the deleteElem function.

I'm not sure how to fix this problem in the core AgensGraph. For now, we're implementing an approach using PostgreSQL advisory locks to get around the problem.

Something like this:

Transaction 1                                       Transaction 2
-------------                                       -------------
begin;
select pg_advisory_xact_lock(1);
match (t:test {id: 1}) set t.description = 'first';
                                                    begin;
                                                    select pg_advisory_xact_lock(1);
                                                    <transaction blocks here>
commit;
                                                    match (t:test {id: 1}) set t.description = 'second';
                                                    commit;

After the above sequence, the value of the description property is "second" and there are no errors.

I'm not proposing the above as a generic workaround, just that it might be something that works for you depending on your application.It may mean that you have to run your queries twice for this to work...once to lock the IDs and once to do the update. Obviously that's not ideal. You also probably need to always acquire your locks in a prescribed order to reduce the chance of deadlocks (i.e., you probably want to sort the ID values that you're using so the locks are always acquired in order).

Hope this helps.

samoscyberallenh commented 1 year ago

Upon further reflection and investigation, I think the right application behavior here is to retry the transaction. In the case this bug was logged on (could not serialize access due to concurrent update), AgensGraph is returning the SQLSTATE in the exception as 40001 (serialization_failure). The recommended way of handling this error in PostgreSQL generally is to retry the transaction. The following quote is from https://www.postgresql.org/docs/15/mvcc-serialization-failure-handling.html:

Both Repeatable Read and Serializable isolation levels can produce errors that are designed to prevent serialization anomalies. As previously stated, applications using these levels must be prepared to retry transactions that fail due to serialization errors. Such an error's message text will vary according to the precise circumstances, but it will always have the SQLSTATE code 40001 (serialization_failure).

One could argue that AgensGraph could do more to mimic the behavior of PostgreSQL in similar situations, but I'm not sure not doing so is a bug given that it does return serialization_failure.

Now, the delete behavior described in my comment above should probably be considered a bug because it returns a SQLSTATE of XX000 (internal_error). It's pretty easy to workaround though (just retry that specific error as you would a serialization_failure).

So I retract my possible workaround using advisory locks.

aviNrich commented 1 year ago

@samoscyberallenh thank you. just to to see that I got your possible solution - you are suggesting to retry the transaction whenever I see code: 40001? If so, what happends if it failes again? Should I retry again? Thanks

samoscyberallenh commented 1 year ago

just to to see that I got your possible solution - you are suggesting to retry the transaction whenever I see code: 40001? If so, what happends if it failes again? Should I retry again?

I poked around a little looking for a good reference for this and found this presentation:

https://www.postgresql.org/files/developer/concurrency.pdf

There's good guidance and examples in there.

If you have to have multiple writer transactions for your application, It's a good idea to define a retry strategy. That means things like:

When should you retry applications?

For PostgreSQL generally, you want to retry these:

The page I mentioned above contained the justification for that (https://www.postgresql.org/docs/15/mvcc-serialization-failure-handling.html).

You can see all PostgreSQL-defined SQLSTATE values in this file (if you're interested): https://github.com/bitnine-oss/agensgraph/blob/main/src/backend/utils/errcodes.txt

For AgensGraph specifically, I'd also suggest retrying exceptions with a SQLSTATE of "0X000" (internal_error) and a message that contains the text "unrecognized heap_update status:". This is something that could be improved in AgensGraph...in my opinion the case above that I showed to return this value should actually return serialization_failure/40001 (but I could be over simplifying because I'm no PostgreSQL internals expert).

It is important that you don't retry everything...some errors are just going to continue happening so retrying isn't helpful.

It's also important that you rollback after the failure, although PG may make you do that anyway.

How many times to retry?

A lot of applications retry the "retryable" errors until they stop happening (i.e., forever). I've always put some sort of a limit on it, but it probably should be a pretty high number because some applications have a lot of concurrent access.

How long to wait between tries?

When an DB is under heavy write load, it may benefit from having failed transactions wait a bit before retrying. A lot of guidance I've seen doesn't actually do that though, so I guess I'd say make it configurable so your Ops team (or you!) can adjust it to try to work around problems.

How do you avoid concurrency issues?

There are some things you may be able to do to avoid concurrency issues all together. For example:

This is a fairly complicated topic and in most cases I've seen, applications simply let the serialization/deadlock errors happen and handle it via retry. If you get a lot of deadlocks then you can sometimes making it better by defining some sort of update ordering.

This seems like a good place to get started in understanding the topic: https://shusson.info/post/dealing-with-deadlocks-in-postgres