IntersectMBO / cardano-db-sync

A component that follows the Cardano chain and stores blocks and transactions in PostgreSQL
Apache License 2.0
291 stars 160 forks source link

DB-SYNC doesnt move #1879

Open NanuIjaz opened 1 week ago

NanuIjaz commented 1 week ago

db-sync 13.5.0.2 , pg 14 , is stuck here and not moving. Eventually it fails and goes back to same stage. Higher work_mem is also set in pg.

[db-sync-node:Warning:81] [2024-10-18 11:25:58.77 UTC] Creating Indexes. This may require an extended period of time to perform. Setting a higher maintenance_work_mem from Postgres usually speeds up this process. These indexes are not used by db-sync but are meant for clients. If you want to skip some of these indexes, you can stop db-sync, delete or modify any migration-4-* files in the schema directory and restart it

error

[db-sync-node:Error:81] [2024-10-18 12:58:12.34 UTC] runDBThread: SqlError {sqlState = "", sqlExecStatus = FatalError, sqlErrorMsg = "", sqlErrorDetail = "", sqlErrorHint = ""}

please help on this.

sgillespie commented 1 week ago

After you get the error, are the indices still being created? Try running this query:

select * from pg_stat_progress_create_index

Also, can you check if postgresql and cardano-node are still running at this point? Also if you could give more information about your environment, that would be helpful:

NanuIjaz commented 1 week ago

Sorry i should have given more details earlier.

I am pretty confused with the nature of issues we have .

we are runnning both db-sync and pg in docker.

i ran the select query you gave , it didnt return anything.

I notice the strange behaviour. Sometimes it gives the error after waiting at this point

[db-sync-node:Info:6] [2024-10-21 11:05:09.83 UTC] Found maintenance_work_mem=2GB, max_parallel_maintenance_workers=4 ExitFailure 2

Errors in file: /tmp/migrate-2024-10-21T11:05:09.835467022Z.log

sometimes it gives the error that i mentioned earlier.

After throwing this error, container restarts and starting syncing again. I can see its waiting here now

[db-sync-node:Info:81] [2024-10-21 12:41:50.32 UTC] Received block which is not in the db with HeaderFields {headerFieldSlot = SlotNo 137938707, headerFieldBlockNo = BlockNo 10990914, headerFieldHash = a544cd2f7bf24902ac5d9b0f674f67b02f46254b82fe8a6fafa58758f7956fba}. Time to restore consistency. [db-sync-node:Info:81] [2024-10-21 12:41:50.32 UTC] Starting at epoch 516

I think it will error out after this, i am watching it waits

sgillespie commented 1 week ago

This message:

Errors in file: /tmp/migrate-2024-10-21T11:05:09.835467022Z.log

Indicates there is a problem running a migration, which will cause db-sync to exit. Can you post the contents of that file?

NanuIjaz commented 1 week ago

I kept losing that file as container restarts. I am tailing the file right now, it doesnt show any messages yet.

NanuIjaz commented 1 week ago

just now it crashed like this

[db-sync-node:Info:81] [2024-10-21 12:41:50.32 UTC] Starting at epoch 516

[db-sync-node:Error:81] [2024-10-21 14:41:53.13 UTC] runDBThread: libpq: failed (no connection to the server ) [db-sync-node:Error:111] [2024-10-21 14:41:53.13 UTC] recvMsgRollForward: AsyncCancelled [db-sync-node:Error:106] [2024-10-21 14:41:53.13 UTC] ChainSyncWithBlocksPtcl: AsyncCancelled [db-sync-node.Subscription:Error:102] [2024-10-21 14:41:53.13 UTC] Identity Application Exception: LocalAddress "/home/cardano/ipc/node.socket" SubscriberError {seType = SubscriberWorkerCancelled, seMessage = "SubscriptionWorker exiting", seStack = []} cardano-db-sync: libpq: failed (no connection to the server )

NanuIjaz commented 1 week ago

this is from the logs ,

Running : migration-1-0000-20190730.sql init

(1 row)

Running : migration-1-0001-20190730.sql migrate

(1 row)

Running : migration-1-0002-20190912.sql psql:/home/cardano/cardano-db-sync/schema/migration-1-0002-20190912.sql:32: NOTICE: Dropping view : "utxo_byron_view" psql:/home/cardano/cardano-db-sync/schema/migration-1-0002-20190912.sql:32: NOTICE: Dropping view : "utxo_view" drop_cexplorer_views

(1 row)

Running : migration-1-0003-20200211.sql migrate

(1 row)

Running : migration-1-0004-20201026.sql migrate

(1 row)

Running : migration-1-0005-20210311.sql migrate

(1 row)

Running : migration-1-0006-20210531.sql migrate

(1 row)

Running : migration-1-0007-20210611.sql migrate

(1 row)

Running : migration-1-0008-20210727.sql migrate

(1 row)

Running : migration-1-0009-20210727.sql migrate

(1 row)

Running : migration-1-0010-20230612.sql migrate

(1 row)

Running : migration-1-0011-20230814.sql migrate

(1 row)

Running : migration-1-0012-20240211.sql migrate

(1 row)

Running : migration-1-0013-20240318.sql migrate

(1 row)

Running : migration-1-0014-20240411.sql migrate

(1 row)

Running : migration-1-0015-20240724.sql migrate

(1 row)

Running : migration-2-0001-20211003.sql migrate

(1 row)

Running : migration-2-0002-20211007.sql migrate

(1 row)

Running : migration-2-0003-20211013.sql migrate

(1 row)

Running : migration-2-0004-20211014.sql migrate

(1 row)

Running : migration-2-0005-20211018.sql migrate

(1 row)

Running : migration-2-0006-20220105.sql migrate

(1 row)

Running : migration-2-0007-20220118.sql migrate

(1 row)

Running : migration-2-0008-20220126.sql migrate

(1 row)

Running : migration-2-0009-20220207.sql migrate

(1 row)

Running : migration-2-0010-20220225.sql migrate

(1 row)

Running : migration-2-0011-20220318.sql migrate

(1 row)

Running : migration-2-0012-20220502.sql migrate

(1 row)

Running : migration-2-0013-20220505.sql migrate

(1 row)

Running : migration-2-0014-20220505.sql migrate

(1 row)

Running : migration-2-0015-20220505.sql migrate

(1 row)

Running : migration-2-0016-20220524.sql migrate

(1 row)

Running : migration-2-0017-20220526.sql migrate

(1 row)

Running : migration-2-0018-20220604.sql migrate

(1 row)

Running : migration-2-0019-20220615.sql migrate

(1 row)

Running : migration-2-0020-20220919.sql migrate

(1 row)

Running : migration-2-0021-20221019.sql migrate

(1 row)

Running : migration-2-0022-20221020.sql migrate

(1 row)

Running : migration-2-0023-20221019.sql migrate

(1 row)

Running : migration-2-0024-20221020.sql migrate

(1 row)

Running : migration-2-0025-20221020.sql migrate

(1 row)

Running : migration-2-0026-20231017.sql migrate

(1 row)

Running : migration-2-0027-20230713.sql migrate

(1 row)

Running : migration-2-0028-20240117.sql migrate

(1 row)

Running : migration-2-0029-20240117.sql migrate

(1 row)

Running : migration-2-0030-20240108.sql migrate

(1 row)

Running : migration-2-0031-20240117.sql migrate

(1 row)

Running : migration-2-0032-20230815.sql migrate

(1 row)

Running : migration-2-0033-20231009.sql migrate

(1 row)

Running : migration-2-0034-20240301.sql migrate

(1 row)

Running : migration-2-0035-20240308.sql migrate

(1 row)

Running : migration-2-0036-20240318.sql migrate

(1 row)

Running : migration-2-0037-20240403.sql migrate

(1 row)

Running : migration-2-0038-20240603.sql migrate

(1 row)

Running : migration-2-0039-20240703.sql migrate

(1 row)

Running : migration-2-0040-20240626.sql migrate

(1 row)

Running : migration-2-0041-20240711.sql migrate

(1 row)

Running : migration-2-0042-20240808.sql migrate

(1 row)

Running : migration-2-0043-20240828.sql migrate

(1 row)

Running : migration-3-0001-20190816.sql Running : migration-3-0002-20200521.sql psql:/home/cardano/cardano-db-sync/schema/migration-3-0002-20200521.sql:4: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql:/home/cardano/cardano-db-sync/schema/migration-3-0002-20200521.sql:4: error: connection to server was lost ExitFailure 2

sgillespie commented 1 week ago

Is it possible you're running out of memory? It seems clear from the logs that you're losing connection to the pg server

NanuIjaz commented 1 week ago

listen_addresses = '*' port = '5432' max_connections = '600' shared_buffers = '32GB' effective_cache_size = '96GB' maintenance_work_mem = '2GB' checkpoint_completion_target = '0.9' wal_buffers = '16MB' default_statistics_target = '100' random_page_cost = '1.0' effective_io_concurrency = '200' work_mem = '8GB' min_wal_size = '1GB' max_wal_size = '4GB' max_worker_processes = '128' max_parallel_workers_per_gather = '16' max_parallel_workers = '64' max_parallel_maintenance_workers = '4' log_min_duration_statement = '2000'

this is our postgres.conf file, I do see high memory consumption , but its not 100%. do you suggest any changes to above?

sgillespie commented 1 week ago

You might want to check out this tool: https://pgtune.leopard.in.ua/. This is what I used to generate my configuration. For my config, I chose "online transaction processing system"

NanuIjaz commented 1 week ago

here is the error, i was able to drill down till this.

2024-10-23 14:57:27.050 GMT [176] LOG: could not receive data from client: Connection reset by peer 2024-10-23 14:57:27.050 GMT [176] LOG: unexpected EOF on client connection with an open transaction

rdlrt commented 1 week ago

That error simply says a client connection was terminated.

You would need to look at your postgres DB crash reason (if needed , look at it outside of docker first), could be mariade of reasons [eg: Running out of infrastructure memory - for which can check oom msgs in system logs, ulimits, corrupted DB WAL markers if you haven't cleared existing DB before, etc].

IMO - github is not the right medium to help you troubleshoot system/infra issues. Maybe discord/forum/stackexchange would be better choices to search for existing or start new thread with better synopsis than what's presented here.