azgs / azlibrary_database

1 stars 1 forks source link

pg-promise has trouble with parallel processing #77

Closed NoisyFlowers closed 6 months ago

NoisyFlowers commented 6 months ago

While working on azgs/azlibrary_api#98, I encountered the following error (JSON from api server):

{
    "code": 500,
    "msg": "Problem while importing. Please try again later",
    "error": [
        [
            {
                "data": [
                    {
                        "data": [
                            null,
                            null
                        ],
                        "status": "resolved"
                    }
                ],
                "status": "resolved"
            },
            {
                "data": [],
                "status": "resolved"
            },
            {
                "data": [
                    null,
                    null,
                    null,
                    null
                ],
                "status": "resolved"
            },
            {
                "error": {
                    "code": "3B001",
                    "file": "xact.c",
                    "line": "4409",
                    "name": "error",
                    "query": "release savepoint sp_1_2",
                    "stack": "error: savepoint \"sp_1_2\" does not exist\n    at Parser.parseErrorMessage (/home/ec2-user/repos/azlibrary_database/azlibAdd/node_modules/pg-protocol/dist/parser.js:287:98)\n    at Parser.handlePacket (/home/ec2-user/repos/azlibrary_database/azlibAdd/node_modules/pg-protocol/dist/parser.js:126:29)\n    at Parser.parse (/home/ec2-user/repos/azlibrary_database/azlibAdd/node_modules/pg-protocol/dist/parser.js:39:38)\n    at TLSSocket.<anonymous> (/home/ec2-user/repos/azlibrary_database/azlibAdd/node_modules/pg-protocol/dist/index.js:11:42)\n    at TLSSocket.emit (node:events:513:28)\n    at addChunk (node:internal/streams/readable:315:12)\n    at readableAddChunk (node:internal/streams/readable:289:9)\n    at TLSSocket.Readable.push (node:internal/streams/readable:228:10)\n    at TLSWrap.onStreamRead (node:internal/stream_base_commons:190:23)",
                    "length": 93,
                    "message": "savepoint \"sp_1_2\" does not exist",
                    "routine": "ReleaseSavepoint",
                    "severity": "ERROR"
                },
                "status": "rejected"
            },
            {
                "status": "resolved"
            }
        ]
    ]
}

From this, it was clear that the problem was in azlibAdd and had to do with postgres access, which is handled by pg-promise (https://github.com/vitaly-t/pg-promise).

To debug, I had to run azlibAdd from command line on same dataset in order to get the full stdout. I added the pg-monitor package (https://github.com/vitaly-t/pg-monitor) to gain insight into what pg-promise was doing. From the output generated by that, I could see that it was creating savepoints labeled sp_1_1 and sp_1_2. But when it tried to release sp_1_2, it apparently did not exist.

Some background: savepoints are a postgres concept for emulating nested transactions, which postgres does not have. Because everything in azlibAdd happens inside a transaction and I use parallel promises via Promise.all, I had a suspicion that pg-promise was losing track of its labels across execution threads.

I rewrote the top level Promise.all logic to use chained promises instead, and the problem went away. Ok, great!

Some more background: parallel execution doesn't really exist inside a transaction. Postgres queues up db actions in the order they arrive. So, it's debatable whether I gained much in writing this code the way I did. But db activity is only part of the action in those threads. There is a fair amount of file system interaction as well, plus I thought it was cool, so I'm no apologizing. And I had never had trouble with pg-promise before. With the move to an AL2 environment, I updated the version of pg-promise to the latest. Apparently, something changed.

No problem, I'll just rip out all the parallel stuff. Except that that will have major structural code impacts throughout azlibAdd and be a nightmare for testing. Besides, the parallelism shouldn't be a problem for postgres. It's pg-promise that is stumbling over it.

After further research and experimentation, I found a better (e.g. minimal impact) way. The subroutines used by azlibAdd were written to take either a root db object or a transaction object. The reasons for this are lost in the sands of time and not important now. What is important is that some of these routines created a transaction to handle multiple database actions. Since, within azlibAdd, these routines are passed a transaction, this means they were trying to create a nested transaction, which, if you'll recall, postgres emulates with savepoints. These savepoints were labeled by pg-promise and pg-promise is losing track of these labels. So, each place a subroutine tried to create a transaction, I changed it to create a task. A task is a pg-promise construct used to force multiple db actions to use the same connection. From the pg-promise docs (https://vitaly-t.github.io/pg-promise/Task.html):

When executed on the root Database object, the connection is allocated from the pool, and once the method's callback has finished, the connection is released back to the pool. However, when invoked inside another task or transaction, the method reuses the parent connection.

By substituting task for transaction, there is no change to the fundamental logic of the subroutine and pg-promise creates no savepoint. I've tested it and the problem is resolved.

NoisyFlowers commented 6 months ago

This is implemented in b75f303055c0bdb4c154f441bdc0e29addd270c8 e47bc99c8c7e832659e8312436fde621a2e6f611

Running on prod