MinaProtocol / mina

Mina is a cryptocurrency protocol with a constant size blockchain, improving scaling while maintaining decentralization and security.
https://minaprotocol.com
Apache License 2.0
1.99k stars 525 forks source link

mina-replayer fails with ERROR: out of shared memory #15211

Open jrwashburn opened 6 months ago

jrwashburn commented 6 months ago

Preliminary Checks

Description

This seems to be a well-known issue, and has been noted in passing in the comments in #10308, and some workarounds #10988, #14904 ; however, there doesn't seem to be an issue open to actually resolve the problem instead of throwing insane resources at the database. I think this is just the first of several resource-related errors that prevent validating an archive.

When running replayer for a sufficiently long chain of blocks, the replayer fails when Loading user commands. Since I use a hosted postgres instance, increasing the max_pred_locks_per_transaction is not an option.

(monitor.ml.Error (Failure "Error querying for user commands with id 1686617, error Request to postgresql://user:pwd@host:port/db failed: ERROR: out of shared memory
\nHINT: You might need to increase max_pred_locks_per_transaction.
\n. Query: " SELECT type,fee_payer_id, source_id,receiver_id,fee,fee_token,token,amount,valid_until,memo,nonce,\n blocks.id,blocks.height,blocks.global_slot_since_genesis,parent.global_slot_since_genesis,\n sequence_no,status,created_token,\n fee_payer_balance, source_balance, receiver_balance\n\n FROM (SELECT * FROM user_commands WHERE id = $1) AS uc\n\n INNER JOIN blocks_user_commands AS buc\n\n ON uc.id = buc.user_command_id\n\n INNER JOIN blocks\n\n ON blocks.id = buc.block_id\n\n INNER JOIN blocks as parent\n\n ON parent.id = blocks.parent_id\n\n ".") ("Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33" "Called from Async_kernel__Deferred1.M.map.(fun) in file "src/deferred1.ml", line 17, characters 40-45" "Called from Async_kernel__Job_queue.run_job in file "src/job_queue.ml" (inlined), line 128, characters 2-5" "Called from Async_kernel__Job_queue.run_jobs in file "src/job_queue.ml", line 169, characters 6-47"))

Steps to Reproduce

  1. (This generates the input file stopping at block 325,000)

    cat /var/lib/coda/mainnet.json | jq '{target_epoch_ledgers_state_hash:"3NLo4bhk6CSp7jW994S25YRHhTs8rLJ1xNLBknAz73Bb5oc2ufYa",genesis_ledger:{add_genesis_winner: true, accounts: .ledger.accounts}}' > replayer_input.json
  2. /usr/local/bin/mina-replayer --checkpoint-interval 5000 --archive-uri postgresql://user:pass@host:port/db --input-file ./replayer_input.json --output-file replayer-output.json --repair-nonces --set-nonces --continue-on-error > replayer.log 2>&1

  3. Fails with error out of shared memory.

Expected Result

Replayer will complete successfully.

Actual Result

Fails with error

Daemon version

Commit c980ba87c3417f40a7081225dfe7478c5ee70fd7 on branch master

How frequently do you see this issue?

Always

What is the impact of this issue on your ability to run a node?

Medium

Status

Status is not relevant. Full log of run attempt:

{"timestamp":"2024-02-25 17:18:56.086893Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1032, characters 6-17"},"message":"Successfully created Caqti pool for Postgresql","metadata":{"pid":2368339}}
{"timestamp":"2024-02-25 17:18:58.693723Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1061, characters 12-23"},"message":"Retrieving fork block state_hash","metadata":{"pid":2368339}}
{"timestamp":"2024-02-25 17:18:58.719921Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1079, characters 6-17"},"message":"Loading block information using target state hash and start slot","metadata":{"pid":2368339}}
{"timestamp":"2024-02-25 17:19:15.387501Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1111, characters 10-21"},"message":"Block chain leading to target state hash includes genesis block, length = 324999","metadata":{"pid":2368339}}
{"timestamp":"2024-02-25 17:19:15.387562Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1138, characters 6-17"},"message":"Loading internal command ids","metadata":{"pid":2368339}}
{"timestamp":"2024-02-25 17:19:17.224033Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1142, characters 6-17"},"message":"Loading user command ids","metadata":{"pid":2368339}}
{"timestamp":"2024-02-25 17:19:25.419784Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1146, characters 6-17"},"message":"Obtained 5270130 user command ids and 288137 internal command ids","metadata":{"pid":2368339}}
{"timestamp":"2024-02-25 17:19:25.419860Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1149, characters 6-17"},"message":"Loading internal commands","metadata":{"pid":2368339}}
(monitor.ml.Error
 (Failure
   "Error querying for internal commands with id 111327, error Request to <postgresql://user:pass@host:port/db> failed: ERROR:  out of shared memory\
  \nHINT:  You might need to increase max_pred_locks_per_transaction.\
  \n. Query: \" SELECT type,receiver_id,receiver_balance,fee,token,\\n              b.id,b.height,b.global_slot_since_genesis,parent.global_slot_since_genesis,\\n              receiver_account_creation_fee_paid,\\n              sequence_no,secondary_sequence_no\\n\\n            FROM (SELECT * FROM internal_commands WHERE id = $2) AS ic\\n\\n            INNER JOIN blocks_internal_commands AS bic\\n\\n            ON ic.id = bic.internal_command_id\\n\\n            INNER JOIN blocks AS b\\n\\n            ON b.id = bic.block_id\\n\\n            INNER JOIN blocks as parent\\n\\n            ON parent.id = b.parent_id\\n\\n            WHERE b.global_slot_since_genesis >= $1\\n\\n       \".")
 ("Raised at Stdlib.failwith in file \"stdlib.ml\", line 29, characters 17-33"
  "Called from Async_kernel__Deferred1.M.map.(fun) in file \"src/deferred1.ml\", line 17, characters 40-45"
  "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"))```

Additional information

No response

jrwashburn commented 6 months ago

This issue is a showstopper for being able to run the replayer - I thought I would be able to control batch size by setting a combination of "target_epoch_ledgers_state_hash" to target a lower height and "start_slot_since_genesis" to use the checkpoint files; however, it looks like the replayer loads all user commands first and I cannot get above ~70k blocks without hitting this error.

jrwashburn commented 6 months ago

What seems to be a simple solution suggested by @psteckler in Discord https://discord.com/channels/484437221055922177/975822561113935892/1211555323383382016 would be to change the queries to look back to the start slot provided with a chain state of canonical, instead of going back to genesis.

emberian commented 6 months ago

Because you were able to run the queries successfully outside the context of replayer, I am suspicious that this may be addressed by using a build from https://github.com/MinaProtocol/mina/pull/15143- the older version of the SQL library we are using has some kind of psql-connection-scoped resource leakage that gets addressed by periodically resetting the DB connection.

jrwashburn commented 6 months ago

Tested with tag: mina-mainnet=1.4.0beta2-fix-compatible-caqti-pool-337b4d7 from #15143 but still fails.

{"timestamp":"2024-02-27 03:26:40.751997Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1032, characters 6-17"},"message":"Successfully created Caqti pool for Postgresql","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:30.736920Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1061, characters 12-23"},"message":"Retrieving fork block state_hash","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:30.781937Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1079, characters 6-17"},"message":"Loading block information using target state hash and start slot","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:31.414247Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1121, characters 8-19"},"message":"Block chain from non-genesis start slot to target state hash has length = 9918","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:31.414330Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1138, characters 6-17"},"message":"Loading internal command ids","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:31.771227Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1142, characters 6-17"},"message":"Loading user command ids","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:32.801583Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1146, characters 6-17"},"message":"Obtained 684719 user command ids and 45190 internal command ids","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:28:32.801674Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1149, characters 6-17"},"message":"Loading internal commands","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:29:13.364743Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1174, characters 6-17"},"message":"Will replay 31930 internal commands","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:29:13.385940Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1201, characters 6-17"},"message":"Populating fee transfer via coinbase cache","metadata":{"pid":2519623}}
{"timestamp":"2024-02-27 03:29:13.419752Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1206, characters 6-17"},"message":"Loading user commands","metadata":{"pid":2519623}}
(monitor.ml.Error
 (Failure
   "Error querying for user commands with id 4650525, error Request to <postgresql://> failed: ERROR:  out of shared memory\
  \nHINT:  You might need to increase max_pred_locks_per_transaction.\
  \n. Query: \" SELECT type,fee_payer_id, source_id,receiver_id,fee,fee_token,token,amount,valid_until,memo,nonce,\\n                   blocks.id,blocks.height,blocks.global_slot_since_genesis,parent.global_slot_since_genesis,\\n                   sequence_no,status,created_token,\\n                   fee_payer_balance, source_balance, receiver_balance\\n\\n            FROM (SELECT * FROM user_commands WHERE id = $1) AS uc\\n\\n            INNER JOIN blocks_user_commands AS buc\\n\\n            ON uc.id = buc.user_command_id\\n\\n            INNER JOIN blocks\\n\\n            ON blocks.id = buc.block_id\\n\\n            INNER JOIN blocks as parent\\n\\n            ON parent.id = blocks.parent_id\\n\\n       \".")
 ("Raised at Stdlib.failwith in file \"stdlib.ml\", line 29, characters 17-33"
  "Called from Async_kernel__Deferred1.M.map.(fun) in file \"src/deferred1.ml\", line 17, characters 40-45"
  "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"))
jrwashburn commented 6 months ago

Forgot to update mina-archive; but still fails after upgrade. Gets a bit farther (user command it dies at is a bit higher)

{"timestamp":"2024-02-27 03:41:45.862124Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 971, characters 6-17"},"message":"Successfully created Caqti pool for Postgresql","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:31.732667Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1000, characters 12-23"},"message":"Retrieving fork block state_hash","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:31.763931Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1018, characters 6-17"},"message":"Loading block information using target state hash and start slot","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:32.558952Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1060, characters 8-19"},"message":"Block chain from non-genesis start slot to target state hash has length = 9918","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:32.559024Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1077, characters 6-17"},"message":"Loading internal command ids","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:32.904157Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1081, characters 6-17"},"message":"Loading user command ids","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:34.015175Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1085, characters 6-17"},"message":"Obtained 684719 user command ids and 45190 internal command ids","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:43:34.015266Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1088, characters 6-17"},"message":"Loading internal commands","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:44:28.543191Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1113, characters 6-17"},"message":"Will replay 31930 internal commands","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:44:28.560107Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1140, characters 6-17"},"message":"Populating fee transfer via coinbase cache","metadata":{"pid":2520940}}
{"timestamp":"2024-02-27 03:44:28.607310Z","level":"Info","source":{"module":"Dune__exe__Replayer","location":"File \"src/app/replayer/replayer.ml\", line 1145, characters 6-17"},"message":"Loading user commands","metadata":{"pid":2520940}}
(monitor.ml.Error
 (Failure
   "Error querying for user commands with id 4944449, error Request to <> failed: ERROR:  out of shared memory\
  \nHINT:  You might need to increase max_pred_locks_per_transaction.\
  \n Query: \" SELECT type,fee_payer_id, source_id,receiver_id,fee,fee_token,token,amount,valid_until,memo,nonce,\\n                   blocks.id,blocks.height,blocks.global_slot_since_genesis,parent.global_slot_since_genesis,\\n                   sequence_no,status,created_token,\\n                   fee_payer_balance, source_balance, receiver_balance\\n\\n            FROM (SELECT * FROM user_commands WHERE id = $1) AS uc\\n\\n            INNER JOIN blocks_user_commands AS buc\\n\\n            ON uc.id = buc.user_command_id\\n\\n            INNER JOIN blocks\\n\\n            ON blocks.id = buc.block_id\\n\\n            INNER JOIN blocks as parent\\n\\n            ON parent.id = blocks.parent_id\\n\\n       \".")
 ("Raised at Stdlib.failwith in file \"stdlib.ml\", line 29, characters 17-33"
  "Called from Async_kernel__Deferred1.M.map.(fun) in file \"src/deferred1.ml\", line 17, characters 40-45"
  "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"))
dkijania commented 6 months ago

@jrwashburn I experienced the same issue when running replayer couple time on different postgres schemas while interrupting it couple of times. What i did was restart postgres service. I don't say that this is a valid solution, but curious if that helped?

dkijania commented 6 months ago

@jrwashburn Can you say more about your environment? Is it isolated just for migration or also is handling some load from mainnet ?

jrwashburn commented 6 months ago

Client - Intel Xeon Plat 8280 64GB RAM DB - Digital Ocean 8cpu/32GB RAM

I have run with and without an active connection to a mina-archive node - both fail.

jrwashburn commented 5 months ago

While digital ocean doesn't support setting max_pred_locks_per_transaction via their ui, sql, or doctl, it does work via the rest api. (ht Ducca).

I was able to update the setting and am running the replayer now.