Testworld-2-0-archive-node archive node cause postgre to act up

EmrePiconbello commented 1 year ago

Preliminary Checks

[X] This issue is not a duplicate. Before opening a new issue, please search existing issues: https://github.com/MinaProtocol/mina/issues
[X] This issue is not a question, feature request, RFC, or anything other than a bug report. Please post those things in GitHub Discussions: https://github.com/MinaProtocol/mina/discussions

Description

Archive process fail with below error. The problem is after this postgre start taking up all the resource in the server.

quest.\n Query: \"BEGIN\"."
(monitor.ml.Error
 ("Async was unable to add a file descriptor to its table of open file descriptors"
  (file_descr 9)
  (error
   "Attempt to register a file descriptor with Async that Async believes it is already managing.")
  (backtrace
   ("Raised by primitive operation at Base__Backtrace.get in file \"src/backtrace.ml\", line 10, characters 2-48"
    "Called from Async_unix__Raw_scheduler.create_fd in file \"src/raw_scheduler.ml\", line 109, characters 66-84"
    "Called from Async_unix__Fd.create in file \"src/fd.ml\" (inlined), line 40, characters 2-112"
    "Called from Caqti_async.System.Unix.wrap_fd in file \"lib-async/caqti_async.ml\", line 44, characters 15-60"
    "Called from Caqti_driver_postgresql.Connect_functor.Make_connection_base.reset.(fun) in file \"lib-driver/caqti_driver_postgresql.ml\", line 603, characters 10-56"
    "Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69"
    "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
    "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"
    "Called from Async_kernel__Scheduler1.run_jobs in file \"src/scheduler1.ml\", line 335, characters 8-51"
    "Called from Async_kernel__Scheduler.run_cycle.run_jobs in file \"src/scheduler.ml\", line 173, characters 10-30"
    "Called from Async_kernel__Scheduler.run_cycle in file \"src/scheduler.ml\", line 181, characters 2-12"
    "Called from Async_unix__Raw_scheduler.have_lock_do_cycle in file \"src/raw_scheduler.ml\", line 631, characters 4-49"
    "Called from Async_unix__Raw_scheduler.be_the_scheduler.loop in file \"src/raw_scheduler.ml\", line 875, characters 6-16"
    "Called from Async_unix__Raw_scheduler.be_the_scheduler in file \"src/raw_scheduler.ml\", line 879, characters 24-31"
    "Called from Async_command.in_async.(fun) in file \"async_command/src/async_command.ml\", line 75, characters 21-38"
    "Called from Core_kernel__Command.For_unix.run.(fun) in file \"src/command.ml\", line 2453, characters 8-238"
    "Called from Base__Exn.handle_uncaught_aux in file \"src/exn.ml\", line 111, characters 6-10"
    "Called from Dune__exe__Archive in file \"src/app/archive/archive.ml\", line 15, characters 6-95"))
  (scheduler
   ((mutex
     ((id_of_thread_holding_lock 0) (num_using_blocker 0) (blocker ())))
    (is_running true) (have_called_go true)
    (fds_whose_watching_has_changed
     (((file_descr 10)
       (info
        (socket
         ((listening_on
           ((file_descr 7) (info ((type_ tcp) (listening_on 0.0.0.0:3086)))
            (kind (Socket Passive)) (supports_nonblock true)
            (have_set_nonblock true) (state (Open Empty))
            (watching ((read (Watch_once Empty)) (write Not_watching)))
            (watching_has_changed false) (num_active_syscalls 1)
            (close_finished Empty)))
          (client 172.17.0.4:34797))))
       (kind (Socket Active)) (supports_nonblock true)
       (have_set_nonblock true) (state (Open Empty))
       (watching ((read (Watch_once Empty)) (write Not_watching)))
       (watching_has_changed true) (num_active_syscalls 1)
       (close_finished Empty))))
    (file_descr_watcher
     (Epoll
      ((timerfd 5)
       (epoll
        (In_use
         ((epollfd 6)
          (flags_by_fd
           ((num_keys 65536)
            (alist ((10 (in)) (7 (in)) (3 (in)) (5 (in et))))))
          (max_ready_events 256) (num_ready_events 0) (ready_events ()))))
       (handle_fd_read_ready <fun>) (handle_fd_write_ready <fun>))))
    (time_spent_waiting_for_io 41725028794736)
    (fd_by_descr
     ((3
       ((file_descr 3) (info interruptor_pipe_read) (kind Fifo)
        (supports_nonblock true) (have_set_nonblock true)
        (state (Open Empty))
        (watching
         ((read (Watch_repeatedly "<Pool.Pointer.t: 0x380000039>" Empty))
          (write Not_watching)))
        (watching_has_changed false) (num_active_syscalls 1)
        (close_finished Empty)))
      (4
       ((file_descr 4) (info interruptor_pipe_write) (kind Fifo)
        (supports_nonblock true) (have_set_nonblock true)
        (state (Open Empty))
        (watching ((read Not_watching) (write Not_watching)))
        (watching_has_changed false) (num_active_syscalls 0)
        (close_finished Empty)))
      (7
       ((file_descr 7) (info ((type_ tcp) (listening_on 0.0.0.0:3086)))
        (kind (Socket Passive)) (supports_nonblock true)
        (have_set_nonblock true) (state (Open Empty))
        (watching ((read (Watch_once Empty)) (write Not_watching)))
        (watching_has_changed false) (num_active_syscalls 1)
        (close_finished Empty)))
      (9
       ((file_descr 9) (info "Caqti_async file descriptor")
        (kind (Socket Active)) (supports_nonblock true)
        (have_set_nonblock false) (state (Open Empty))
        (watching ((read Not_watching) (write Not_watching)))
        (have_set_nonblock false) (state (Open Empty))                                                                                                                                                                            
        (watching ((read Not_watching) (write Not_watching)))
        (watching_has_changed false) (num_active_syscalls 0)
        (close_finished Empty)))
      (10
       ((file_descr 10)
        (info
         (socket
          ((listening_on
            ((file_descr 7) (info ((type_ tcp) (listening_on 0.0.0.0:3086)))
             (kind (Socket Passive)) (supports_nonblock true)
             (have_set_nonblock true) (state (Open Empty))
             (watching ((read (Watch_once Empty)) (write Not_watching)))
             (watching_has_changed false) (num_active_syscalls 1)
             (close_finished Empty)))
           (client 172.17.0.4:34797))))
        (kind (Socket Active)) (supports_nonblock true)
        (have_set_nonblock true) (state (Open Empty))
        (watching ((read (Watch_once Empty)) (write Not_watching)))
        (watching_has_changed true) (num_active_syscalls 1)
        (close_finished Empty)))))
    (timerfd (5)) (timerfd_set_at (2023-10-17 21:40:19.970007040Z))
    (scheduler_thread_id 0)
    (interruptor
     ((pipe
       ((read
         ((file_descr 3) (info interruptor_pipe_read) (kind Fifo)
          (supports_nonblock true) (have_set_nonblock true)
          (state (Open Empty))
          (watching
           ((read (Watch_repeatedly "<Pool.Pointer.t: 0x380000039>" Empty))
            (write Not_watching)))
          (watching_has_changed false) (num_active_syscalls 1)
          (close_finished Empty)))
        (write
         ((file_descr 4) (info interruptor_pipe_write) (kind Fifo)
          (supports_nonblock true) (have_set_nonblock true)
          (state (Open Empty))
          (watching ((read Not_watching) (write Not_watching)))
          (watching_has_changed false) (num_active_syscalls 0)
          (close_finished Empty)))))
      (already_interrupted false) (clearbuffer <opaque>)))
    (signal_manager
     ((handlers_by_signal ((-8 ((bag ()))))) (delivered <opaque>)
      (thread_safe_notify_signal_delivered <fun>)))
    (thread_pool
     ((id 0) (state In_use)
      (finished
       ((value ()) (num_waiting 0) (mutex <opaque>) (full <opaque>)))
      (mutex
       ((id_of_thread_holding_lock -1) (num_using_blocker 0) (blocker ())))
      (default_priority 0) (max_num_threads 50) (cpu_affinity Inherit)
      (num_threads 2) (thread_creation_failure_lockout 1s)
      (last_thread_creation_failure ())
      (thread_by_id
       ((1
         ((name fcntl_getfl) (thread_id (1)) (priority 0) (state Available)
          (unfinished_work 0)
          (work_queue
           ((ev_q ()) (maxsize 1) (mutex <opaque>) (not_empty <opaque>)
            (not_full <opaque>) (finally <fun>)))))
        (2
         ((name fcntl_getfl) (thread_id (2)) (priority 0) (state Available)
          (unfinished_work 0)
          (work_queue
          ((ev_q ()) (maxsize 1) (mutex <opaque>) (not_empty <opaque>)
           (not_full <opaque>) (finally <fun>))))))
      (work_queue ()) (unfinished_work 0) (num_work_completed 403)))
    (handle_thread_pool_stuck <fun>) (next_tsc_calibration 739305383405037)
    (kernel_scheduler
     ((check_access ())
      (job_pool
       ((slots_per_tuple 3) (capacity 32) (length 17) (next_id 16853)
        (first_free (Free 93)) (dummy (<opaque>))))
      (normal_priority_jobs
       ((num_jobs_run 216891) (jobs_left_this_cycle 467) (jobs <opaque>)
        (mask 15) (front 5) (length 0)))
      (low_priority_jobs
       ((num_jobs_run 776) (jobs_left_this_cycle 500) (jobs <opaque>)
        (mask 0) (front 0) (length 1)))
      (very_low_priority_workers ())
      (main_execution_context
       ((monitor
         (((name main) (here ()) (id 1) (has_seen_error false)
           (is_detached false))))
        (priority Normal) (local_storage ()) (tid 0) (backtrace_history ())))
      (current_execution_context
       ((monitor
         (((name "") (here ()) (id 2) (has_seen_error false)
       ((monitor                                                                                                                                                                                                                     [2/5250]
         (((name "") (here ()) (id 2) (has_seen_error false)
           (is_detached true))))
        (priority Normal) (local_storage ()) (tid 0) (backtrace_history ())))
      (uncaught_exn ()) (cycle_count 316310)
      (cycle_start "2023-10-17 21:40:19.313943906Z") (in_cycle true)
      (run_every_cycle_start (<fun>)) (run_every_cycle_end ())
      (last_cycle_time 21.23us) (last_cycle_num_jobs 2)
      (total_cycle_time 5.811219969s)
      (time_source (wall_clock (now 2023-10-17T21:40:19+0000)))
      (external_jobs
       ((length 0) (front ((value None) (next <opaque>)))
        (back ((value None) (next <opaque>)))
        (unused_elts (Some ((value None) (next <opaque>))))))
      (thread_safe_external_job_hook <fun>) (job_queued_hook ())
      (event_added_hook ()) (yield <opaque>)
      (yield_until_no_jobs_remain <opaque>) (check_invariants false)
      (max_num_jobs_per_priority_per_cycle 500) (record_backtraces false)
      (on_start_of_cycle <fun>) (on_end_of_cycle <fun>)
      (long_jobs_last_cycle ()) (cycle_started false)))
    (have_lock_do_cycle ()) (max_inter_cycle_timeout 50ms)
    (min_inter_cycle_timeout 0s) (initialized_at <opaque>))))
 ("Raised at Base__Error.raise in file \"src/error.ml\" (inlined), line 8, characters 14-30"
  "Called from Base__Error.raise_s in file \"src/error.ml\", line 9, characters 19-40"
  "Called from Async_unix__Fd.create in file \"src/fd.ml\" (inlined), line 40, characters 2-112"
  "Called from Caqti_async.System.Unix.wrap_fd in file \"lib-async/caqti_async.ml\", line 44, characters 15-60"
  "Called from Caqti_driver_postgresql.Connect_functor.Make_connection_base.reset.(fun) in file \"lib-driver/caqti_driver_postgresql.ml\", line 603, characters 10-56"
  "Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69"
  "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"))

Steps to Reproduce

1.Run archive node wait till error 2.Postgre start pulling all the resource from server and make whole server become mostly unresponsive. 3.Without killing the postgre it doesn't resolve. While it might be postgre related problem I never encounter this kind of issue on postgre we utilize for various tests ...

Expected Result

Crash and recover on it's own. At least not cause postgre the clog up the server.

Actual Result

No matter what we throw as specs this issue happened after some time frame and then postgre take all of the server resources.

How frequently do you see this issue?

Frequently

What is the impact of this issue on your ability to run a node?

High

Status

Global number of accounts:                     200716
Block height:                                  921
Max observed block height:                     921
Max observed unvalidated block height:         921
Local uptime:                                  2d17h50m34s
Ledger Merkle root:                            jxPEXnnaDKoE9xuGTbQcyUAcHsCjERtCqvAk7wxRafuyRdKt3Ei
Protocol state hash:                           3NKMdU2rkK62KyXdzj3ddHvLG4VRfNXcnpi13NGoit1ve4w9ZT7D
Chain id:                                      332c8cc05ba8de9efc23a011f57015d8c9ec96fac81d5d3f7a06969faf4bce92
Git SHA-1:                                     55b78189c46e1811b8bdb78864cfa95409aeb96a
Configuration directory:                       /root/.mina-config
Peers:                                         158
User_commands sent:                            0
SNARK worker:                                  None
SNARK work fee:                                100000000                                                                                                                                                                                     Sync status:                                   Synced
Catchup status:                                                                                                                                                                                                                                      Finished:  559
                                                                                                                                                                                                                                             Block producers running:                       1 (B62qm35gEuL3F4dLEUpnXTW3tyCbmrBAyY5bmrMggFFdpNLLEbE8khc)
Coinbase receiver:                             Block producer                                                                                                                                                                                Best tip consensus time:                       epoch=0, slot=1314
Best tip global slot (across all hard-forks):  1314                                                                                                                                                                                          Next block will be produced in:                in 2.323d for slot: 2430 slot-since-genesis: 2430 (Generated from consensus at slot: 104 slot-since-genesis: 104)
Consensus time now:                            epoch=0, slot=1315                                                                                                                                                                            Consensus mechanism:                           proof_of_stake
Consensus configuration:                                                                                                                                                                                                                             Delta:                     0
        k:                         290                                                                                                                                                                                                               Slots per epoch:           7140
        Slot duration:             3m                                                                                                                                                                                                                Epoch duration:            14d21h
        Chain start timestamp:     2023-10-17 16:01:01.000000Z                                                                                                                                                                                       Acceptable network delay:  3m
                                                                                                                                                                                                                                             Addresses and ports:
        External IP:                                                                                                                                                                                                                  
        Bind IP:        0.0.0.0                                                                                                                                                                                                    
        Libp2p PeerID:  12D3KooWMPxTu24mCpi3TwmkU4fJk7a8TQ4agFZeTHQRi8KCc3nj
        Libp2p port:    8302
        Client port:    8301

Metrics:
        block_production_delay:             7 (1 0 0 0 0 0 0)
        transaction_pool_diff_received:     2
        transaction_pool_diff_broadcasted:  0
        transactions_added_to_pool:         4554
        transaction_pool_size:              2
        snark_pool_diff_received:           315
        snark_pool_diff_broadcasted:        0
        pending_snark_work:                 0
        snark_pool_size:                    5605

Additional information

No response

psteckler commented 1 year ago

Just checked O(1)'s testworld archive dbs, both have no missing blocks

dkijania commented 10 months ago

@EmrePiconbello you wrote in #14414:

[...] because of other weird behavior with postgre we move the postgre to remote isolated instance. Since we move it to remote postgre other issue we reported didn't happen.

Am i correct that you are no longer suffering from this issue after your infra changes?

EmrePiconbello commented 10 months ago

This issue cause us to migrate to remote postgre but having remote postgre have other issues.

amc-ie commented 10 months ago

@EmrePiconbello can you provide more information around these other issues? Thanks in advance!

EmrePiconbello commented 10 months ago

@amc-ie I will try to summaries. The server we are running postgre and archive process and the mina node was using 128gb ram and all the cpu or something like that. We were using docker. At the time this was the error we see in the server which I opened this issue.

We didn't investigate further assuming it's docker/postgre version related issue which archive node cause it to trigger.

Following that we just plug the archive process to our remote test postgre server which is operational since 2018. By doing that we want to eliminate the variable of postgre and docker.

Archive process connected to remote postgre brought up this issue which was causing it to miss blocks etc. https://github.com/MinaProtocol/mina/issues/14415

After some research we also realize this issue still persist https://github.com/MinaProtocol/mina/issues/14421 we spot it years ago few times but we never knew it was related to archive node since we closed our archive in very short time frame after mainnet.

https://github.com/MinaProtocol/mina/issues/14755 This is some kind of a workaround for the above issue.

At the moment I would just consider this one as a problem unless it's worked on https://github.com/MinaProtocol/mina/issues/14415

For product archive node ready for actual product environment releases. Having a archive node which is highly available auto scale have multiple shards and clusters... is very important so archiving process to local postgre and remote postgre should be similar.

MinaProtocol / mina