MinaProtocol / mina

Mina is a cryptocurrency protocol with a constant size blockchain, improving scaling while maintaining decentralization and security.
https://minaprotocol.com
Apache License 2.0
1.98k stars 525 forks source link

Archive Node stopped (Mina Berkeley 2.0rumpup4) #14299

Open olton opened 10 months ago

olton commented 10 months ago

Preliminary Checks

Description

Archive process stopped with error

2023-10-08 06:41:55 UTC [Warn] Error when adding block data to the database, rolling back transaction: $error                                                       error: "Request to <postgres://mina:_@localhost:5432/mina_archive> failed: Connection failure: server closed the connection unexpectedly\n\tThis pro
bably means the server terminated abnormally\n\tbefore or while processing the request.\n Query: \"BEGIN\"."                                                (monitor.ml.Error                                                                                                                                            ("Async was unable to add a file descriptor to its table of open file descriptors"                                                                           (file_descr 8)                                                                                                                                              (error                                                                                                                                                       "Attempt to register a file descriptor with Async that Async believes it is already managing.")                                                            (backtrace                                                                                                                                                   ("Raised by primitive operation at Base__Backtrace.get in file \"src/backtrace.ml\", line 10, characters 2-48"                                               "Called from Async_unix__Raw_scheduler.create_fd in file \"src/raw_scheduler.ml\", line 109, characters 66-84"                                              "Called from Async_unix__Fd.create in file \"src/fd.ml\" (inlined), line 40, characters 2-112"                                                              "Called from Caqti_async.System.Unix.wrap_fd in file \"lib-async/caqti_async.ml\", line 44, characters 15-60"                                               "Called from Caqti_driver_postgresql.Connect_functor.Make_connection_base.reset.(fun) in file \"lib-driver/caqti_driver_postgresql.ml\", line 603, characters 10-56"                                                                                                                                                    "Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69"                                                    "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"                                              "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"                                                      "Called from Async_kernel__Scheduler1.run_jobs in file \"src/scheduler1.ml\", line 335, characters 8-51"                                                    "Called from Async_kernel__Scheduler.run_cycle.run_jobs in file \"src/scheduler.ml\", line 173, characters 10-30"                                           "Called from Async_kernel__Scheduler.run_cycle in file \"src/scheduler.ml\", line 181, characters 2-12"                                                     "Called from Async_unix__Raw_scheduler.have_lock_do_cycle in file \"src/raw_scheduler.ml\", line 631, characters 4-49"
    "Called from Async_unix__Raw_scheduler.be_the_scheduler.loop in file \"src/raw_scheduler.ml\", line 875, characters 6-16"
    "Called from Async_unix__Raw_scheduler.be_the_scheduler in file \"src/raw_scheduler.ml\", line 879, characters 24-31"
    "Called from Async_command.in_async.(fun) in file \"async_command/src/async_command.ml\", line 75, characters 21-38"
    "Called from Core_kernel__Command.For_unix.run.(fun) in file \"src/command.ml\", line 2453, characters 8-238"
    "Called from Base__Exn.handle_uncaught_aux in file \"src/exn.ml\", line 111, characters 6-10"                                                               "Called from Dune__exe__Archive in file \"src/app/archive/archive.ml\", line 15, characters 6-95"))
...
("Raised at Base__Error.raise in file \"src/error.ml\" (inlined), line 8, characters 14-30"
  "Called from Base__Error.raise_s in file \"src/error.ml\", line 9, characters 19-40"
  "Called from Async_unix__Fd.create in file \"src/fd.ml\" (inlined), line 40, characters 2-112"
  "Called from Caqti_async.System.Unix.wrap_fd in file \"lib-async/caqti_async.ml\", line 44, characters 15-60"
  "Called from Caqti_driver_postgresql.Connect_functor.Make_connection_base.reset.(fun) in file \"lib-driver/caqti_driver_postgresql.ml\", line 603, characters 10-56"
  "Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69"
  "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"))

Steps to Reproduce

No steps

Expected Result

null

Actual Result

null

How frequently do you see this issue?

Rarely

What is the impact of this issue on your ability to run a node?

Low

Status

Mina daemon status
-----------------------------------

Max observed block height:              6528
Max observed unvalidated block height:  0
Local uptime:                           12m3s
Chain id:                               3c41383994b87449625df91769dff7b507825c064287d30fada9286f3f1cb15e
Git SHA-1:                              14047c55517cf3587fc9a6ac55c8f7e80a419571
Configuration directory:                /home/olton/.mina-config
Peers:                                  33
User_commands sent:                     0
SNARK worker:                           None
SNARK work fee:                         100000000
Sync status:                            Bootstrap
Block producers running:                0
Coinbase receiver:                      Block producer
Consensus time now:                     epoch=1, slot=4747
Consensus mechanism:                    proof_of_stake
Consensus configuration:
        Delta:                     0
        k:                         290
        Slots per epoch:           7140
        Slot duration:             3m
        Epoch duration:            14d21h
        Chain start timestamp:     2023-09-13 13:01:01.000000Z
        Acceptable network delay:  3m

Addresses and ports:
        External IP:    
        Bind IP:        0.0.0.0
        Libp2p PeerID:  12D3KooWDLNXPq28An4s2QaPZX5ftem1AfaCWuxHHJq97opeWxLy
        Libp2p port:    8302
        Client port:    8301

Metrics:
        block_production_delay:             7 (0 0 0 0 0 0 0)
        transaction_pool_diff_received:     124
        transaction_pool_diff_broadcasted:  0
        transactions_added_to_pool:         0
        transaction_pool_size:              0

Additional information

No response

psteckler commented 10 months ago

That error is coming from Postgres. Do you have Postgres logs to share?

olton commented 10 months ago

I will see if there is a log, but the archive process should not stop with an error due to temporary loss of connection to the database

psteckler commented 10 months ago

the archive process should not stop with an error due to temporary loss of connection to the database

Is there a log from the archive process indicating that it stopped? The poster states that the "archive process stopped", but the log shown above does not indicate that.

psteckler commented 10 months ago

@olton Did the archive process actually halt in this case? That wasn't clear to me.

olton commented 10 months ago

Don't know at the moment, I ran the archive process with pm2 so it would restart on a crash.

psteckler commented 10 months ago

Seeing what happens with a local network if the Postgres process is terminated.

psteckler commented 10 months ago

Using a local network, the archive process does crash if the Postgresql process has terminated.

The archive process crash shows up as:

(monitor.ml.Error
 ("Async was unable to add a file descriptor to its table of open file descriptors"
  (file_descr 21)
  (error
   "Attempt to register a file descriptor with Async that Async believes it is already managing.")
  ...

To remedy this, we'd have to catch the failure, and have Caqti establish a new connection once Postgresql is available again.