Open jrwashburn opened 8 months ago
This issue is a showstopper for being able to run the replayer - I thought I would be able to control batch size by setting a combination of "target_epoch_ledgers_state_hash" to target a lower height and "start_slot_since_genesis" to use the checkpoint files; however, it looks like the replayer loads all user commands first and I cannot get above ~70k blocks without hitting this error.
What seems to be a simple solution suggested by @psteckler in Discord https://discord.com/channels/484437221055922177/975822561113935892/1211555323383382016 would be to change the queries to look back to the start slot provided with a chain state of canonical, instead of going back to genesis.
Because you were able to run the queries successfully outside the context of replayer, I am suspicious that this may be addressed by using a build from https://github.com/MinaProtocol/mina/pull/15143- the older version of the SQL library we are using has some kind of psql-connection-scoped resource leakage that gets addressed by periodically resetting the DB connection.
Tested with tag: mina-mainnet=1.4.0beta2-fix-compatible-caqti-pool-337b4d7
from #15143 but still fails.
{"timestamp":"2024-02-27 03:26:40.751997Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1032, characters 6-17"},"message":"Successfully created Caqti pool for Postgresql","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:30.736920Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1061, characters 12-23"},"message":"Retrieving fork block state_hash","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:30.781937Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1079, characters 6-17"},"message":"Loading block information using target state hash and start slot","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:31.414247Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1121, characters 8-19"},"message":"Block chain from non-genesis start slot to target state hash has length = 9918","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:31.414330Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1138, characters 6-17"},"message":"Loading internal command ids","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:31.771227Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1142, characters 6-17"},"message":"Loading user command ids","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:32.801583Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1146, characters 6-17"},"message":"Obtained 684719 user command ids and 45190 internal command ids","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:32.801674Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1149, characters 6-17"},"message":"Loading internal commands","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:29:13.364743Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1174, characters 6-17"},"message":"Will replay 31930 internal commands","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:29:13.385940Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1201, characters 6-17"},"message":"Populating fee transfer via coinbase cache","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:29:13.419752Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1206, characters 6-17"},"message":"Loading user commands","metadata":{"pid":2519623}}
(monitor.ml.Error
(Failure
"Error querying for user commands with id 4650525, error Request to <postgresql://> failed: ERROR: out of shared memory\
\nHINT: You might need to increase max_pred_locks_per_transaction.\
\n. Query: \" SELECT type,fee_payer_id, source_id,receiver_id,fee,fee_token,token,amount,valid_until,memo,nonce,\\n blocks.id,blocks.height,blocks.global_slot_since_genesis,parent.global_slot_since_genesis,\\n sequence_no,status,created_token,\\n fee_payer_balance, source_balance, receiver_balance\\n\\n FROM (SELECT * FROM user_commands WHERE id = $1) AS uc\\n\\n INNER JOIN blocks_user_commands AS buc\\n\\n ON uc.id = buc.user_command_id\\n\\n INNER JOIN blocks\\n\\n ON blocks.id = buc.block_id\\n\\n INNER JOIN blocks as parent\\n\\n ON parent.id = blocks.parent_id\\n\\n \".")
("Raised at Stdlib.failwith in file \"stdlib.ml\", line 29, characters 17-33"
"Called from Async_kernel__Deferred1.M.map.(fun) in file \"src/deferred1.ml\", line 17, characters 40-45"
"Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
"Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"))
Forgot to update mina-archive; but still fails after upgrade. Gets a bit farther (user command it dies at is a bit higher)
{"timestamp":"2024-02-27 03:41:45.862124Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 971, characters 6-17"},"message":"Successfully created Caqti pool for Postgresql","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:31.732667Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1000, characters 12-23"},"message":"Retrieving fork block state_hash","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:31.763931Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1018, characters 6-17"},"message":"Loading block information using target state hash and start slot","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:32.558952Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1060, characters 8-19"},"message":"Block chain from non-genesis start slot to target state hash has length = 9918","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:32.559024Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1077, characters 6-17"},"message":"Loading internal command ids","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:32.904157Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1081, characters 6-17"},"message":"Loading user command ids","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:34.015175Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1085, characters 6-17"},"message":"Obtained 684719 user command ids and 45190 internal command ids","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:34.015266Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1088, characters 6-17"},"message":"Loading internal commands","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:44:28.543191Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1113, characters 6-17"},"message":"Will replay 31930 internal commands","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:44:28.560107Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1140, characters 6-17"},"message":"Populating fee transfer via coinbase cache","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:44:28.607310Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1145, characters 6-17"},"message":"Loading user commands","metadata":{"pid":2520940}}
(monitor.ml.Error
(Failure
"Error querying for user commands with id 4944449, error Request to <> failed: ERROR: out of shared memory\
\nHINT: You might need to increase max_pred_locks_per_transaction.\
\n Query: \" SELECT type,fee_payer_id, source_id,receiver_id,fee,fee_token,token,amount,valid_until,memo,nonce,\\n blocks.id,blocks.height,blocks.global_slot_since_genesis,parent.global_slot_since_genesis,\\n sequence_no,status,created_token,\\n fee_payer_balance, source_balance, receiver_balance\\n\\n FROM (SELECT * FROM user_commands WHERE id = $1) AS uc\\n\\n INNER JOIN blocks_user_commands AS buc\\n\\n ON uc.id = buc.user_command_id\\n\\n INNER JOIN blocks\\n\\n ON blocks.id = buc.block_id\\n\\n INNER JOIN blocks as parent\\n\\n ON parent.id = blocks.parent_id\\n\\n \".")
("Raised at Stdlib.failwith in file \"stdlib.ml\", line 29, characters 17-33"
"Called from Async_kernel__Deferred1.M.map.(fun) in file \"src/deferred1.ml\", line 17, characters 40-45"
"Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
"Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"))
@jrwashburn I experienced the same issue when running replayer couple time on different postgres schemas while interrupting it couple of times. What i did was restart postgres service. I don't say that this is a valid solution, but curious if that helped?
@jrwashburn Can you say more about your environment? Is it isolated just for migration or also is handling some load from mainnet ?
Client - Intel Xeon Plat 8280 64GB RAM DB - Digital Ocean 8cpu/32GB RAM
I have run with and without an active connection to a mina-archive node - both fail.
While digital ocean doesn't support setting max_pred_locks_per_transaction via their ui, sql, or doctl, it does work via the rest api. (ht Ducca).
I was able to update the setting and am running the replayer now.
Preliminary Checks
Description
This seems to be a well-known issue, and has been noted in passing in the comments in #10308, and some workarounds #10988, #14904 ; however, there doesn't seem to be an issue open to actually resolve the problem instead of throwing insane resources at the database. I think this is just the first of several resource-related errors that prevent validating an archive.
When running replayer for a sufficiently long chain of blocks, the replayer fails when Loading user commands. Since I use a hosted postgres instance, increasing the max_pred_locks_per_transaction is not an option.
Steps to Reproduce
(This generates the input file stopping at block 325,000)
/usr/local/bin/mina-replayer --checkpoint-interval 5000 --archive-uri postgresql://user:pass@host:port/db --input-file ./replayer_input.json --output-file replayer-output.json --repair-nonces --set-nonces --continue-on-error > replayer.log 2>&1
Fails with error out of shared memory.
Expected Result
Replayer will complete successfully.
Actual Result
Fails with error
Daemon version
Commit c980ba87c3417f40a7081225dfe7478c5ee70fd7 on branch master
How frequently do you see this issue?
Always
What is the impact of this issue on your ability to run a node?
Medium
Status
Status is not relevant. Full log of run attempt:
Additional information
No response