MinaProtocol / mina

Mina is a cryptocurrency protocol with a constant size blockchain, improving scaling while maintaining decentralization and security.
https://minaprotocol.com
Apache License 2.0
1.98k stars 525 forks source link

UMT-upgrade-mechanism-testing archive node does not finish loading genesis ledger #15207

Open jrwashburn opened 6 months ago

jrwashburn commented 6 months ago

Preliminary Checks

Description

Started archive node - several hours go by with no additional log messages after this:

Feb 23 15:12:22 nym3 systemd[735]: Started Mina Archive Service.
Feb 23 15:12:28 nym3 mina-archive[1308272]: {"timestamp":"2024-02-23 20:12:28.155412Z","level":"Info","source":{"module":"Genesis_ledger_helper","location":"File \"src/lib/genesis_ledger_helper/genesis_ledger_helper.ml\", line 820, characters 2-13"},"message":"Initializing with runtime configuration. Ledger name: $name","metadata":{"name":"UMT","pid":1308272}}
Feb 23 15:12:28 nym3 mina-archive[1308272]: {"timestamp":"2024-02-23 20:12:28.155521Z","level":"Info","source":{"module":"Genesis_ledger_helper","location":"File \"src/lib/genesis_ledger_helper/genesis_ledger_helper.ml\", line 842, characters 8-19"},"message":"Using the compiled constraint constants","metadata":{"pid":1308272}}
Feb 23 15:12:28 nym3 mina-archive[1308272]: {"timestamp":"2024-02-23 20:12:28.909294Z","level":"Info","source":{"module":"Genesis_ledger_helper","location":"File \"src/lib/genesis_ledger_helper/genesis_ledger_helper.ml\", line 890, characters 2-13"},"message":"Loaded genesis ledger from $ledger_file","metadata":{"ledger_file":"/tmp/coda_cache_dir/genesis_ledger_accounts_81d2056807c18d45d06f542530fcfd5113f313405827f6f6ea3e3c0025574a82.tar.gz","pid":1308272}}

During this time, mina-archive is not listening for connections. Database has a block 1, and daemon is complaining that it cannot connect to the archive.

I restarted the archive process after 2 hours, eventually removing --config-file parameter so that it would listen for the daemon and start getting blocks.

Later, I started a second instance of the archive process with the --config-file parameter and a different server port and let it run overnight. 6 hours in, it threw a deadlock error trying to insert public_keys for an account that is present.

2024-02-24 05:49:46 UTC [Info] Initializing with runtime configuration. Ledger name: "UMT"
2024-02-24 05:49:46 UTC [Info] Using the compiled constraint constants
2024-02-24 05:49:47 UTC [Info] Loaded genesis ledger from $ledger_file
        ledger_file: "/tmp/coda_cache_dir/genesis_ledger_accounts_81d2056807c18d45d06f542530fcfd5113f313405827f6f6ea3e3c0025574a82.tar.gz"
2024-02-24 10:48:23 UTC [Error] Failed to add genesis account: $account, see $error
        account: {
  "public_key": "B62qmATfKHQWCNoGafsmURNmDbaQd2prUNLQeF3G9J1jz2EbAwp2AeA",
  "token_id": "1",
  "token_permissions": [ "Not_owned", { "account_disabled": false } ],
  "balance": "1000000000000",
  "nonce": "0",
  "receipt_chain_hash":
    "2mzbV7WevxLuchs2dAMY4vQBS6XttnCUF8Hvks4XNBQ5qiSGGBQe",
  "delegate": "B62qmATfKHQWCNoGafsmURNmDbaQd2prUNLQeF3G9J1jz2EbAwp2AeA",
  "voting_for": "3NK2tkzqqK5spR2sZ7tujjqPksL45M3UUrcA4WhCkeiPtnugyE2x",
  "timing": [ "Untimed" ],
  "permissions": {
    "stake": true,
    "edit_state": [ "Signature" ],
    "send": [ "Signature" ],
    "receive": [ "None" ],
    "set_delegate": [ "Signature" ],
    "set_permissions": [ "Signature" ],
    "set_verification_key": [ "Signature" ]
  },
  "snapp": null
}
        error: "Request to <postgresql://user:db:port/umt-mainnet?sslmode=require> failed: ERROR:  deadlock detected\nDETAIL:  Process 1679555 waits for ShareLock on transaction 7504992; blocked by process 1529659.\nProcess 1529659 waits for ShareLock on transaction 7503864; blocked by process 1679555.\nHINT:  See server log for query details.\nCONTEXT:  while inserting index tuple (2294,3) in relation \"public_keys_value_key\"\n. Query: \"INSERT INTO public_keys (value) VALUES ($1) RETURNING id\"."
2024-02-24 10:48:23 UTC [Warn] Error in add_genesis_accounts : $error. Retrying...
        error: "Request to <postgresql://user:host:port/umt-mainnet?sslmode=require> failed: ERROR:  deadlock detected\nDETAIL:  Process 1679555 waits for ShareLock on transaction 7504992; blocked by process 1529659.\nProcess 1529659 waits for ShareLock on transaction 7503864; blocked by process 1679555.\nHINT:  See server log for query details.\nCONTEXT:  while inserting index tuple (2294,3) in relation \"public_keys_value_key\"\n. Query: \"INSERT INTO public_keys (value) VALUES ($1) RETURNING id\"."
2024-02-24 10:48:25 UTC [Info] Initializing with runtime configuration. Ledger name: "UMT"
2024-02-24 10:48:25 UTC [Info] Using the compiled constraint constants
2024-02-24 10:48:26 UTC [Info] Loaded genesis ledger from $ledger_file
        ledger_file: "/tmp/coda_cache_dir/genesis_ledger_accounts_81d2056807c18d45d06f542530fcfd5113f313405827f6f6ea3e3c0025574a82.tar.gz"

There are only 58 rows in the public_keys table currently, and there are several large id gaps on the public_keys table e.g. no keys with id after 1 until 110667, and several other large (and small) gaps.

Steps to Reproduce

  1. Create new database
  2. Run schema script
  3. Start archive node with
    ExecStart=/usr/local/bin/mina-archive run \
    --postgres-uri $POSTGRES_URI \
    --server-port 3086 \
    --metrics-port 3099 \
    --log-json \
    --log-level DEBUG \
    --config-file /var/lib/coda/config_2025a732.json
  4. Stop archive node after two hours, restart without --config-file
  5. Separately, restart archive node with (so both are running)
    mina-archive run --postgres-uri postgresql://user:pass@db:port/umt-mainnet?sslmode=require --config-file /var/lib/coda/config_2025a732.json --server-port 4086
  6. second instance died with deadlock reported after 5 hours; never starts listening for connections (so far.)

Expected Result

Archive will load genesis accounts and start listening for blocks

Actual Result

Archive never finishes loading genesis blocks.

Daemon version

Commit [DIRTY]2025a732df43c6735a576aff907252f74cf2ef35 on branch testing/hard-fork-internal

How frequently do you see this issue?

Always

What is the impact of this issue on your ability to run a node?

Blocker

Status

mina client status
Mina daemon status
-----------------------------------

Global number of accounts:                     201806
Block height:                                  59
Max observed block height:                     81
Max observed unvalidated block height:         81
Local uptime:                                  8h1m2s
Ledger Merkle root:                            jy1ZvdzScUkWJDnMKnhispuohRMViiX4M3NncS4LB6Ed7XH7jeT
Protocol state hash:                           3NKpZWHWGxq1bHC47ZX6h8HGoZE5YQaNpHGNv5jX5W3zZe2xawW2
Chain id:                                      100ef4660dd2f1dc184e5730cc21f5ecfbe2f7b24af237fe049ca425cea6633e
Git SHA-1:                                     [DIRTY]2025a732df43c6735a576aff907252f74cf2ef35
Configuration directory:                       /home/minar/.mina-config
Peers:                                         41
User_commands sent:                            0
SNARK worker:                                  None
SNARK work fee:                                100000000
Sync status:                                   Catchup
Catchup status:                                
        To build breadcrumb:           0
        To initial validate:           0
        Finished:                      70
        To download:                   0
        Waiting for parent to finish:  0
        To verify:                     0

Block producers running:                       0
Coinbase receiver:                             Block producer
Best tip consensus time:                       epoch=0, slot=489
Best tip global slot (across all hard-forks):  489
Consensus time now:                            epoch=0, slot=547
Consensus mechanism:                           proof_of_stake
Consensus configuration:                       
        Delta:                     0
        k:                         290
        Slots per epoch:           7140
        Slot duration:             3m
        Epoch duration:            14d21h
        Chain start timestamp:     2024-02-23 10:00:00.000000Z
        Acceptable network delay:  3m

Addresses and ports:                           
        External IP:    209.134.35.154
        Bind IP:        0.0.0.0
        Libp2p PeerID:  12D3KooWRkA6kt61EZhoTuKvM7rEWzzwRUMbNoSs9BtfYUpLqb8B
        Libp2p port:    8304
        Client port:    8301

Metrics:                                       
        block_production_delay:             7 (0 0 0 0 0 0 0)
        transaction_pool_diff_received:     132
        transaction_pool_diff_broadcasted:  0
        transactions_added_to_pool:         13
        transaction_pool_size:              3

Additional information

There are no log messages after the messages in this post, so not sharing logs. mina-archive seems to just be slowly busy doing something after it reports Loaded genesis ledger from $ledger_file.

jrwashburn commented 6 months ago

Next run took another ~5 hours but did eventually complete:

2024-02-24 10:48:25 UTC [Info] Initializing with runtime configuration. Ledger name: "UMT"
2024-02-24 10:48:25 UTC [Info] Using the compiled constraint constants
2024-02-24 10:48:26 UTC [Info] Loaded genesis ledger from $ledger_file        ledger_file: "/tmp/coda_cache_dir/genesis_ledger_accounts_81d2056807c18d45d06f542530fcfd5113f313405827f6f6ea3e3c0025574a82.tar.gz"
2024-02-24 15:46:59 UTC [Info] Archive process ready. Clients can now connect
jrwashburn commented 6 months ago

I am not closing this issue because I assume this is an unacceptably long duration for the actual hard fork. If it can be run well in advance of the cutover window, it may be closed. (And of course, improving performance would be better.)

jrwashburn commented 6 months ago

I started up a new archive database because the replayer is reporting nonce discrepancies. That's giving me an opportunity to experience this really long / slow startup again.

I'm running this time on a 16c/64GB Intel Xeon Platinum 8358, and I'm watching it run. I've turned off the mina daemon so the only thing running on this dedicated box is mina-archive. It was sometimes using 100% of one core early on - I think when doing some file processing of the input json perhaps. Since the some single core utilization in the beginning the box is almost always idle. It has a database connection that has been established, and is slowly doing one thing at a time (for now, lookup a key id, and insert its timing info.

I suspect this slowness is all network related waiting on lots of tiny queries, all wrapped up in a client-side transaction so the client is a bit slow. I'm just watching it slowly, 1 by 1 select id from public keys, then insert timing info for the public key.

The queries themselves are basically instantaneous, the connections is always sitting in idle waiting on the client.

This really needs to be reconsidered before a mainnet upgrade. It may be fine when the archive node and db are colocated on a single machine, but over the network, this is painfully slow.

jrwashburn commented 6 months ago

I was able to confirm that this is just network. The node I started yesterday is still trying to process the config file. Spinning up a new node with a local archive db finished in 6 minutes. The queries need to be redesigned to support processing on a remote database within a reasonable amount of time. I don't think it's reasonable to expect the process to be running on the same hardware as the database.

snaitmouloud-mf commented 5 months ago

I was able to confirm that this is just network. The node I started yesterday is still trying to process the config file. Spinning up a new node with a local archive db finished in 6 minutes. The queries need to be redesigned to support processing on a remote database within a reasonable amount of time. I don't think it's reasonable to expect the process to be running on the same hardware as the database.

Thanks @jrwashburn for the follow-up and update. We will keep this PR open for future improvements. Let me know if this is ok

jrwashburn commented 5 months ago

I don't think it's reasonable to expect the process to be running on the same hardware as the database.

If this issue won't be resolved before the fork, then the trust-less process needs to be changed to include a local Postgres installation, create the new database locally, start the archive with --config-file / pointed to local db, then backup the local db and restore remotely to the "production" server, and change archive to point there. That needs to happen before the network restart because of #15261.