NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I want to restart the Postgres server nwcal-wres-prod02 after doing a bit of investigation work into why it stopped #165

Open epag opened 3 weeks ago

epag commented 3 weeks ago

Author Name: Hank (Hank) Original Redmine Issue: 127570, https://vlab.noaa.gov/redmine/issues/127570 Original Date: 2024-03-11 Original Assignee: Hank


I confirmed that nwcal-wresdb-prod02 does not have Postgres running (though it is accessible and I can sudo in to user postgres). Here is a list of the processes owned by postgres on nwcal-wresdb-prod01:

-bash-4.2$ ps -ef | grep postgres postgres 3319 1 0 Jan31 ? 00:10:35 /usr/pgsql-13/bin/postmaster -D /var/lib/pgsql/13/data/ postgres 3967 3319 0 Jan31 ? 00:00:24 postgres: logger postgres 4014 3319 0 Jan31 ? 00:15:25 postgres: checkpointer postgres 4015 3319 0 Jan31 ? 00:05:39 postgres: background writer postgres 4016 3319 0 Jan31 ? 00:34:49 postgres: walwriter postgres 4017 3319 0 Jan31 ? 00:10:33 postgres: autovacuum launcher postgres 4018 3319 0 Jan31 ? 00:37:54 postgres: stats collector postgres 4019 3319 0 Jan31 ? 00:00:01 postgres: logical replication launcher root 65620 65592 0 23:10 pts/0 00:00:00 sudo su - postgres root 65629 65620 0 23:10 pts/0 00:00:00 su - postgres postgres 65630 65629 0 23:10 pts/0 00:00:00 -bash postgres 65672 65630 0 23:10 pts/0 00:00:00 ps -ef postgres 65673 65630 0 23:10 pts/0 00:00:00 grep --color=auto postgres

Here is -prod02:

-bash-4.2$ ps -ef | grep postgres root 110971 110946 0 23:09 pts/2 00:00:00 sudo su - postgres root 110980 110971 0 23:09 pts/2 00:00:00 su - postgres postgres 110981 110980 0 23:09 pts/2 00:00:00 -bash postgres 111029 110981 0 23:10 pts/2 00:00:00 ps -ef postgres 111030 110981 0 23:10 pts/2 00:00:00 grep --color=auto postgres

I also confirmed that clean-and-switch attempts have been failing since Sun Mar 10 12:00:01 UTC 2024, so the server shutdown some time before then. Disk space does not appear to be an issue on -prod02.

Anyway, more tomorrow morning,

Hank


Redmine related issue(s): 128350


epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-11T23:24:17Z


Here is the bottom of the log file (@/var/lib/pgsql/13/data/pg_log/postgresql-0310.log@):

2024-03-10 01:08:47.230 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.233 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.236 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 LOG:  disconnection: session time: 2:06:07.805 user=wres_user8 database=wres8 host=10.3.10.104 port=34678
2024-03-10 01:08:47.236 UTC 10.3.10.104(34108) wres8 [98212] wres_user8 65ece0d5.17fa4 LOG:  disconnection: session time: 2:47:38.165 user=wres_user8 database=wres8 host=10.3.10.104 port=34108
2024-03-10 01:08:48.628 UTC 10.3.10.104(36420) wres8 [109284] wres_user8 65ed03d5.1aae4 LOG:  disconnection: session time: 0:18:19.345 user=wres_user8 database=wres8 host=10.3.10.104 port=36420
2024-03-10 01:08:48.628 UTC 10.3.10.104(36430) wres8 [109298] wres_user8 65ed03e4.1aaf2 LOG:  disconnection: session time: 0:18:03.951 user=wres_user8 database=wres8 host=10.3.10.104 port=36430
2024-03-10 01:08:48.628 UTC 10.3.10.104(36432) wres8 [109299] wres_user8 65ed03e4.1aaf3 LOG:  disconnection: session time: 0:18:03.826 user=wres_user8 database=wres8 host=10.3.10.104 port=36432
2024-03-10 01:08:48.628 UTC 10.3.10.104(36438) wres8 [109306] wres_user8 65ed03eb.1aafa LOG:  disconnection: session time: 0:17:56.994 user=wres_user8 database=wres8 host=10.3.10.104 port=36438
2024-03-10 01:08:48.628 UTC 10.3.10.104(36446) wres8 [109331] wres_user8 65ed03fd.1ab13 LOG:  disconnection: session time: 0:17:39.358 user=wres_user8 database=wres8 host=10.3.10.104 port=36446
2024-03-10 01:08:48.628 UTC 10.3.10.104(36452) wres8 [109340] wres_user8 65ed0406.1ab1c LOG:  disconnection: session time: 0:17:29.875 user=wres_user8 database=wres8 host=10.3.10.104 port=36452
2024-03-10 01:08:48.629 UTC 10.3.10.104(36456) wres8 [109353] wres_user8 65ed0413.1ab29 LOG:  disconnection: session time: 0:17:16.638 user=wres_user8 database=wres8 host=10.3.10.104 port=36456
2024-03-10 01:08:48.629 UTC 10.3.10.104(36462) wres8 [109358] wres_user8 65ed0418.1ab2e LOG:  disconnection: session time: 0:17:11.849 user=wres_user8 database=wres8 host=10.3.10.104 port=36462
2024-03-10 01:08:48.629 UTC 10.3.10.104(36464) wres8 [109359] wres_user8 65ed0418.1ab2f LOG:  disconnection: session time: 0:17:11.784 user=wres_user8 database=wres8 host=10.3.10.104 port=36464
2024-03-10 01:08:48.630 UTC 10.3.10.104(36474) wres8 [109401] wres_user8 65ed0449.1ab59 LOG:  disconnection: session time: 0:16:22.756 user=wres_user8 database=wres8 host=10.3.10.104 port=36474
2024-03-10 01:08:48.630 UTC 10.3.10.104(36484) wres8 [109424] wres_user8 65ed0465.1ab70 LOG:  disconnection: session time: 0:15:55.497 user=wres_user8 database=wres8 host=10.3.10.104 port=36484
2024-03-10 01:08:48.630 UTC 10.3.10.104(36466) wres8 [109378] wres_user8 65ed042e.1ab42 LOG:  disconnection: session time: 0:16:50.236 user=wres_user8 database=wres8 host=10.3.10.104 port=36466
2024-03-10 01:08:48.636 UTC 10.3.10.104(36418) wres8 [109280] wres_user8 65ed03d1.1aae0 LOG:  disconnection: session time: 0:18:22.811 user=wres_user8 database=wres8 host=10.3.10.104 port=36418
2024-03-10 01:08:48.636 UTC 10.3.10.104(36428) wres8 [109292] wres_user8 65ed03de.1aaec LOG:  disconnection: session time: 0:18:10.165 user=wres_user8 database=wres8 host=10.3.10.104 port=36428
2024-03-10 01:08:48.637 UTC 10.3.10.104(36440) wres8 [109311] wres_user8 65ed03f1.1aaff LOG:  disconnection: session time: 0:17:51.027 user=wres_user8 database=wres8 host=10.3.10.104 port=36440
2024-03-10 01:08:48.638 UTC 10.3.10.104(36450) wres8 [109338] wres_user8 65ed0405.1ab1a LOG:  disconnection: session time: 0:17:30.931 user=wres_user8 database=wres8 host=10.3.10.104 port=36450
2024-03-10 01:08:48.638 UTC 10.3.10.104(36454) wres8 [109350] wres_user8 65ed0411.1ab26 LOG:  disconnection: session time: 0:17:18.672 user=wres_user8 database=wres8 host=10.3.10.104 port=36454
2024-03-10 01:08:48.778 UTC 10.3.10.104(36410) wres8 [109258] wres_user8 65ed03b7.1aaca LOG:  disconnection: session time: 0:18:49.598 user=wres_user8 database=wres8 host=10.3.10.104 port=36410
2024-03-10 01:08:54.274 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:09:04.434 UTC   [110630]  65ed0826.1b026 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:09:14.264 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:09:24.406 UTC   [110647]  65ed083a.1b037 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:09:34.268 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:09:44.413 UTC   [110664]  65ed084e.1b048 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:09:54.269 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:10:04.413 UTC   [110791]  65ed0862.1b0c7 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:10:14.277 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:10:24.423 UTC   [110808]  65ed0876.1b0d8 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:10:34.284 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:10:44.463 UTC   [110825]  65ed088a.1b0e9 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:10:54.369 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:11:04.625 UTC   [110842]  65ed089e.1b0fa LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:11:14.366 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:11:24.563 UTC   [110859]  65ed08b2.1b10b LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:11:34.395 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:11:44.637 UTC   [110876]  65ed08c6.1b11c LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:11:54.399 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:12:04.604 UTC   [110894]  65ed08da.1b12e LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:12:14.405 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:32:11.356 UTC   [87467]  65ecbf10.155ab PANIC:  could not fsync file "base/16385/446051": Input/output error
2024-03-10 01:32:11.542 UTC   [91649]  65d670a4.16601 LOG:  checkpointer process (PID 87467) was terminated by signal 6: Aborted
2024-03-10 01:32:11.542 UTC   [91649]  65d670a4.16601 LOG:  terminating any other active server processes
2024-03-10 01:32:11.542 UTC   [87470]  65ecbf10.155ae WARNING:  terminating connection because of crash of another server process
2024-03-10 01:32:11.542 UTC   [87470]  65ecbf10.155ae DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2024-03-10 01:32:11.542 UTC   [87470]  65ecbf10.155ae HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2024-03-10 01:32:11.765 UTC   [91649]  65d670a4.16601 LOG:  all server processes terminated; reinitializing
2024-03-10 01:32:13.096 UTC   [112428]  65ed0d9d.1b72c LOG:  database system was interrupted; last known up at 2024-03-10 01:01:13 UTC
2024-03-10 01:32:49.486 UTC   [112428]  65ed0d9d.1b72c LOG:  could not fsync file "./pg_log/postgresql-0310.log": Input/output error
2024-03-10 01:32:49.525 UTC   [112428]  65ed0d9d.1b72c LOG:  database system was not properly shut down; automatic recovery in progress
2024-03-10 01:32:49.991 UTC   [112428]  65ed0d9d.1b72c LOG:  redo starts at 1637/5E61C960
2024-03-10 01:34:12.097 UTC   [112428]  65ed0d9d.1b72c LOG:  invalid record length at 1638/1E795A08: wanted 24, got 0
2024-03-10 01:34:12.098 UTC   [112428]  65ed0d9d.1b72c LOG:  redo done at 1638/1E7959E0
2024-03-10 01:34:12.302 UTC   [112428]  65ed0d9d.1b72c LOG:  checkpoint starting: end-of-recovery immediate
2024-03-10 01:34:59.861 UTC   [112428]  65ed0d9d.1b72c PANIC:  could not access status of transaction 0
2024-03-10 01:34:59.861 UTC   [112428]  65ed0d9d.1b72c DETAIL:  Could not fsync file "pg_multixact/offsets/0081": Input/output error.
2024-03-10 01:35:00.023 UTC   [91649]  65d670a4.16601 LOG:  startup process (PID 112428) was terminated by signal 6: Aborted
2024-03-10 01:35:00.023 UTC   [91649]  65d670a4.16601 LOG:  aborting startup due to startup process failure
2024-03-10 01:35:00.624 UTC   [91649]  65d670a4.16601 LOG:  database system is shut down
epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-11T23:25:50Z


Maybe there was a file system or mount issue on 3/10 at 1:32Z? I'll probably need ITSG to take a look at that.

Hank

EDIT: Fixed typo

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T11:26:43Z


Using the @nwcal-wresdb-prod01@ as a basis for comparison, all of the messages I shared from the log, above, are rare or non-existent. For example, I found none of these messages in -prod01:

using stale statistics instead of current ones because stats collector is not responding

These messages are very rare:

you don't own a lock of type ExclusiveLock

So, I think the problem started just after 1:07Z on 3/10. It starts with using stale statistics, then mentions locks, and then disconnections:

2024-03-10 01:07:34.382 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:07:44.980 UTC   [110558]  65ed07d6.1afde LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:07:54.262 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:07:57.325 UTC   [87467]  65ecbf10.155ab LOG:  checkpoint complete: wrote 494112 buffers (51.5%); 0 WAL file(s) added, 123 removed, 319 recycled; write=1649.861 s, sync=371.585 s, total=2439.560 s; sync files=183, longest=35.066 s, average=2.031 s; distance=7232127 kB, estimate=7690449 kB
2024-03-10 01:07:57.349 UTC   [87467]  65ecbf10.155ab LOG:  checkpoint starting: time
2024-03-10 01:08:04.439 UTC   [110578]  65ed07ea.1aff2 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:08:14.254 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:08:24.418 UTC   [110595]  65ed07fe.1b003 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:08:34.257 UTC   [87470]  65ecbf10.155ae LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:08:44.425 UTC   [110612]  65ed0812.1b014 LOG:  using stale statistics instead of current ones because stats collector is not responding
2024-03-10 01:08:47.229 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.229 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.229 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.229 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.230 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.230 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.230 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.230 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.230 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.233 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 WARNING:  you don't own a lock of type ExclusiveLock
2024-03-10 01:08:47.236 UTC 10.3.10.104(34678) wres8 [101264] wres_user8 65ecea8f.18b90 LOG:  disconnection: session time: 2:06:07.805 user=wres_user8 database=wres8 host=10.3.10.104 port=34678
2024-03-10 01:08:47.236 UTC 10.3.10.104(34108) wres8 [98212] wres_user8 65ece0d5.17fa4 LOG:  disconnection: session time: 2:47:38.165 user=wres_user8 database=wres8 host=10.3.10.104 port=34108
2024-03-10 01:08:48.628 UTC 10.3.10.104(36420) wres8 [109284] wres_user8 65ed03d5.1aae4 LOG:  disconnection: session time: 0:18:19.345 user=wres_user8 database=wres8 host=10.3.10.104 port=36420
2024-03-10 01:08:48.628 UTC 10.3.10.104(36430) wres8 [109298] wres_user8 65ed03e4.1aaf2 LOG:  disconnection: session time: 0:18:03.951 user=wres_user8 database=wres8 host=10.3.10.104 port=36430
2024-03-10 01:08:48.628 UTC 10.3.10.104(36432) wres8 [109299] wres_user8 65ed03e4.1aaf3 LOG:  disconnection: session time: 0:18:03.826 user=wres_user8 database=wres8 host=10.3.10.104 port=36432

The rest of the failure is then noted in the previous comment.

When looking up, "Postgres server could not fsync file Input/output error," I see mention of disk issues. I'm going to see if I have any access to commands that can tell me if/when a disk mount came down and was brought back up... or something like that.

Hank

EDIT: Removed stray words

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-03-12T11:37:14Z


The warnings about exclusive locking are probably attempts to clean, which require an exclusive lock (for destruction).

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T11:43:27Z


Nothing new, but sharing:

-bash-4.2$ systemctl status postgresql-13
● postgresql-13.service - PostgreSQL 13 database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql-13.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sun 2024-03-10 01:35:00 UTC; 2 days ago
     Docs: https://www.postgresql.org/docs/13/static/
  Process: 91649 ExecStart=/usr/pgsql-13/bin/postmaster -D ${PGDATA} (code=exited, status=1/FAILURE)
 Main PID: 91649 (code=exited, status=1/FAILURE)

The CPU utilization plot in Check_MK is consistent with a long-running evaluation perhaps making use of the database, which might explain the failed locks (which prevented the clean, per James):

!cpu_utilization_prod02.png!

With the execution log not being available, I'm going to see if I can find the standard logs.

Need to take my kid to school. When I return, I'll scan those logs, and perhaps try to restart the Postgres server manually (after finding the command to use),

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T12:12:12Z


First, from the clean-and-switch perspective, here is the sequence of events:

< 3/10 0Z: wresdb-prod02 is active.

3/10 0Z: Successfully cleaned-and-switched to wresdb-prod01. 3/10 12Z: Attempts to clean-and-switch to wresdb-prod02, but the clean fails.
Every half-hour after: Continues to attempt a clean-and-switch, but the clean fails.

The Postgres server failed out at about 1:30Z. If the timezone of the Check_MK image is UTC, then there was an evaluation that extended beyond the clean-and-switch, ending around 1:30Z.

I found one evaluation that extended beyond the clean-and-switch at 0Z: job id 7293259659872711481 ended at 1:08Z. It was an EnsPost evaluation that lasted almost 3 hours. Here is some suspicious logging from that evaluation:

2024-03-09T21:23:24.778+0000 INFO EvaluationSubscriber Started subscriber 9XrGL6QUaUPsJ3awRAbhRZtpuYs
2024-03-09T21:23:24.780+0000 INFO SourceLoader Loading the declared datasets. Depending on many factors (including dataset size, dataset design, data service implementation, service availability, network bandwidth, network latency, storage bandwidth, storage latency, concurrent evaluations on shared resources, concurrent computation on shared resources) this can take a while...
2024-03-09T21:28:57.230+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@6e3fe2d6
2024-03-09T21:29:32.645+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@3c2fda36
2024-03-09T21:29:32.712+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 112500 on connection 1 in 5ms.
2024-03-09T21:33:36.794+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@3059c563
2024-03-09T21:34:13.191+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@5bf0bf85
2024-03-09T21:34:17.650+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@72e6b538
2024-03-09T21:42:38.357+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@4011f34f
2024-03-09T21:43:13.084+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@5d985a76
2024-03-09T21:43:17.005+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@fc8adbe
2024-03-09T21:45:37.422+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@666601a3
2024-03-09T21:49:12.492+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@3cbb65f
2024-03-09T21:49:12.522+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 113193 on connection 1 in 5ms.
2024-03-09T21:49:13.464+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 113193 on connection 1 in 5ms.
2024-03-09T21:49:46.394+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@7192d86e
2024-03-09T22:00:59.314+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@7c5678e5
2024-03-09T22:01:01.489+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@66ebce68
2024-03-09T22:01:01.583+0000 WARN DatabaseLockManagerPostgres Lost two lock-holding database connections at once.
2024-03-09T22:01:34.617+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@4687de76
2024-03-09T22:02:34.566+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@118167a4
2024-03-09T22:03:24.534+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@7a479a44
2024-03-09T22:03:59.400+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@14bf4aba
2024-03-09T22:04:33.330+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@637f7992
2024-03-09T22:05:11.496+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@5fb6d88d
2024-03-09T22:07:06.478+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@5506f156
2024-03-09T22:07:48.542+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@34f1ff8e
2024-03-09T22:08:21.321+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@6f8f12e4
2024-03-09T22:10:05.676+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@4b38cbde
2024-03-09T22:10:39.519+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@3f8cc25f
2024-03-09T22:11:14.575+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@41440e00
2024-03-09T22:11:49.870+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@7a814bdb
2024-03-09T22:12:24.938+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@67c17525
2024-03-09T22:13:27.953+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@36f79687
2024-03-09T22:16:21.670+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@3edb4cf0
2024-03-09T22:16:56.532+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@709fd8aa
2024-03-09T22:17:29.523+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@652c9b1a
2024-03-09T22:19:22.571+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@13394e6e
2024-03-09T22:21:04.366+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@64869750
2024-03-09T22:33:06.455+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@515bcd
2024-03-09T23:01:59.896+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@575157b0
2024-03-09T23:02:35.812+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@35c13b8f
2024-03-09T23:02:35.840+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 118648 on connection 1 in 5ms.
2024-03-09T23:02:35.847+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 118648 on connection 1 in 5ms.
2024-03-09T23:02:35.854+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 118648 on connection 1 in 5ms.
2024-03-09T23:02:35.861+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 118648 on connection 1 in 5ms.
2024-03-09T23:02:35.868+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 118648 on connection 1 in 5ms.
2024-03-09T23:02:35.879+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
wres.io.database.locking.DatabaseLockFailed: Another WRES instance is performing a conflicting function. Failed to lock|unlock with prefix=2, lockName=118648, operation=LOCK_EXCLUSIVE
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:900)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-03-10T00:55:02.456+0000 INFO SourceLoader Finished loading the declared datasets in PT3H31M37.675751433S.
2024-03-10T00:55:04.173+0000 INFO DatabaseProject The identity of the database project is '8'.
2024-03-10T00:55:10.671+0000 INFO DatabaseProject Validating the project and loading preliminary metadata...

Were there any other evaluations? Looking,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T12:24:55Z


6023895363893115176 - Post just after the clean-and-switch, so it was handled by wresdb-prod01. It was one of NWRFC's evaluations and encountered the disconnection issue Evan is looking into.

No other evaluation was using the @wresdb-prod02@ at that time.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T12:36:01Z


Focusing on job 7293259659872711481, it was a pretty typical EnsPost evaluation, with about 1.9 GB of HEFS output, total, to ingest. We've handled more than that before and in far less that the 3h 31m reported in the job:

2024-03-10T00:55:02.456+0000 INFO SourceLoader Finished loading the declared datasets in PT3H31M37.675751433S.

Something was already going wrong. Check_MK shows no evidence of a disk filling up and that job appears to be the only job that was active during its run.

Unless someone can point to something else to look into, I'm going to try to start the database. Looking for the command to use,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T12:41:53Z


See #111494-14. I think the command is this:

/usr/pgsql-13/bin/pg_ctl -D /var/lib/pgsql/13/data restart

I might need to use @start@ instead of @restart@.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-03-12T12:44:00Z


I think @restart@ just wraps a @stop@ and @start@, from memory.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T12:46:53Z


From #110846-19, if the server isn't already running when doing the @restart@, it will try to start it.

I don't know the cause of the issues, which may require some ITSG intervention, but I want to see if I can start the server manually. So, I'm going to give that a try now.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T12:47:29Z


James wrote:

I think @restart@ just wraps a @stop@ and @start@, from memory.

Right. The stop will fail, but the start will proceed. Or I can just run @start@. Anyway, giving it a shot,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T12:49:05Z


I don't recall it taking this long to start. Progress so far:

-bash-4.2$ /usr/pgsql-13/bin/pg_ctl -D /var/lib/pgsql/13/data restart
pg_ctl: PID file "/var/lib/pgsql/13/data/postmaster.pid" does not exist
Is server running?
trying to start server anyway
waiting for server to start.....2024-03-12 12:47:52.286 UTC   [70094]  65f04ef7.111ce LOG:  redirecting log output to logging collector process
2024-03-12 12:47:52.286 UTC   [70094]  65f04ef7.111ce HINT:  Future log output will appear in directory "pg_log".
...............................................
epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T13:01:23Z


This is confusing. I restart the database and it takes a very long time:

-bash-4.2$ /usr/pgsql-13/bin/pg_ctl -D /var/lib/pgsql/13/data restart
pg_ctl: PID file "/var/lib/pgsql/13/data/postmaster.pid" does not exist
Is server running?
trying to start server anyway
waiting for server to start.....2024-03-12 12:47:52.286 UTC   [70094]  65f04ef7.111ce LOG:  redirecting log output to logging collector process
2024-03-12 12:47:52.286 UTC   [70094]  65f04ef7.111ce HINT:  Future log output will appear in directory "pg_log".
.......................................................... stopped waiting
pg_ctl: server did not start in time

I go into the March 12 log, which I could have sworn did not exist before I ran this comment, and yet it includes log messages from hours ago:

2023-03-12 07:26:22.840 UTC 10.3.22.1(56918) [unknown] [27093] [unknown] 640d7e9e.69d5 LOG:  connection received: host=10.3.22.1 port=56918
2023-03-12 07:26:22.841 UTC 10.3.22.1(56918) [unknown] [27093] [unknown] 640d7e9e.69d5 LOG:  invalid length of startup packet
2023-03-12 07:26:33.078 UTC 10.3.22.1(59418) [unknown] [27106] [unknown] 640d7ea9.69e2 LOG:  connection received: host=10.3.22.1 port=59418
2023-03-12 07:26:33.078 UTC 10.3.22.1(59418) [unknown] [27106] [unknown] 640d7ea9.69e2 LOG:  invalid length of startup packet
2023-03-12 07:27:01.224 UTC 10.3.22.1(35734) [unknown] [27135] [unknown] 640d7ec5.69ff LOG:  connection received: host=10.3.22.1 port=35734
2023-03-12 07:27:02.913 UTC 10.3.22.1(35734) [unknown] [27135] [unknown] 640d7ec5.69ff LOG:  invalid length of startup packet
2023-03-12 07:27:02.928 UTC 10.3.22.1(35776) [unknown] [27141] [unknown] 640d7ec6.6a05 LOG:  connection received: host=10.3.22.1 port=35776
2023-03-12 07:27:02.929 UTC 10.3.22.1(35776) [unknown] [27141] [unknown] 640d7ec6.6a05 LOG:  invalid length of startup packet
2023-03-12 07:27:02.929 UTC 10.3.22.1(35778) [unknown] [27142] [unknown] 640d7ec6.6a06 LOG:  connection received: host=10.3.22.1 port=35778
2023-03-12 07:27:02.929 UTC 10.3.22.1(35778) [unknown] [27142] [unknown] 640d7ec6.6a06 FATAL:  unsupported frontend protocol 32769.1: server supports 2.0 to 3.0
...

There are a series of queries and other messages logged which appear to correlate to a vacuum and on-going evaluation. For example:

2023-03-12 12:23:11.318 UTC   [2300]  640dc42f.8fc LOG:  automatic analyze of table "wres8.wres.timeseriesvalue_lead_negative_8" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-03-12 12:23:11.329 UTC   [2300]  640dc42f.8fc LOG:  automatic analyze of table "wres8.wres.timeseriesvalue_lead_negative_9" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-03-12 12:23:11.340 UTC   [2300]  640dc42f.8fc LOG:  automatic analyze of table "wres8.wres.timeseriesvalue_lead_negative_6" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-03-12 12:23:11.351 UTC   [2300]  640dc42f.8fc LOG:  automatic analyze of table "wres8.wres.timeseriesvalue_lead_negative_7" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-03-12 12:23:11.362 UTC   [2300]  640dc42f.8fc LOG:  automatic analyze of table "wres8.wres.timeseriesvalue_lead_negative_10" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-03-12 12:23:12.914 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 ERROR:  could not serialize access due to read/write dependencies among transactions
2023-03-12 12:23:12.914 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 DETAIL:  Reason code: Canceled on identification as a pivot, during conflict out checking.
2023-03-12 12:23:12.914 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 HINT:  The transaction might succeed if retried.
2023-03-12 12:23:12.914 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 CONTEXT:  SQL statement "SELECT 1 FROM ONLY "wres"."feature" x WHERE "feature_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
2023-03-12 12:23:12.914 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 STATEMENT:  COMMIT
2023-03-12 12:23:18.041 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 ERROR:  could not serialize access due to read/write dependencies among transactions
2023-03-12 12:23:18.041 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 DETAIL:  Reason code: Canceled on identification as a pivot, during write.
2023-03-12 12:23:18.041 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 HINT:  The transaction might succeed if retried.
2023-03-12 12:23:18.041 UTC 10.3.10.104(60104) wres8 [2281] wres_user8 640dc42b.8e9 STATEMENT:  INSERT INTO wres.Source ( path, lead, hash, is_point_data, feature_id, timescale_id, measurementunit_id, variable_name )
            SELECT $1, $2, $3, $4, $5, $6, $7, $8
            WHERE NOT EXISTS
            (
                SELECT 1
                FROM wres.Source
                WHERE hash = $9
            )
        RETURNING *

Way down at the bottom I see this:

2024-03-12 12:49:19.388 UTC   [70098]  65f04ef8.111d2 LOG:  database system was not properly shut down; automatic recovery in progress
2024-03-12 12:49:19.399 UTC   [70098]  65f04ef8.111d2 LOG:  redo starts at 1637/5E61C960
2024-03-12 12:50:09.838 UTC   [70098]  65f04ef8.111d2 LOG:  invalid record length at 1638/1E795A08: wanted 24, got 0
2024-03-12 12:50:09.838 UTC   [70098]  65f04ef8.111d2 LOG:  redo done at 1638/1E7959E0
2024-03-12 12:50:09.841 UTC   [70098]  65f04ef8.111d2 LOG:  checkpoint starting: end-of-recovery immediate
2024-03-12 12:50:23.473 UTC   [70098]  65f04ef8.111d2 LOG:  checkpoint complete: wrote 559422 buffers (58.3%); 0 WAL file(s) added, 109 removed, 83 recycled; write=4.208 s, sync=8.862 s, total=13.635 s; sync files=196, longest=0.589 s, average=0.046 s; distance=3147236 kB, estimate=3147236 kB
2024-03-12 12:50:23.770 UTC   [70094]  65f04ef7.111ce LOG:  database system is ready to accept connections
2024-03-12 12:55:35.100 UTC 10.3.13.29(51376) [unknown] [70712] [unknown] 65f050c7.11438 LOG:  connection received: host=10.3.13.29 port=51376
2024-03-12 12:55:35.114 UTC 10.3.13.29(51376) wres8 [70712] wres_user8 65f050c7.11438 LOG:  connection authorized: user=wres_user8 database=wres8 application_name=psql SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)

The "connection received" is me connecting via @psql@ despite the fact that the restart timed out. I was able to connect. In fact, I can even query the database and see the very long EnsPost evaluation at the very end:

 HEFS_EnsPost Streamflow                 | 3974b05d5dae8b136308e5686a1aab80 | syWFg1LS0VjkKJrkxfnqGfF4tR8 | execute lab | f      | 2024-03-09 19:57:58.350658 | 2024-03-09 20:16:08.578488 |    18.170463836193086
 HEFS_EnsPost Streamflow                 | 929830c8f28b3a365bec1cb663774ce9 | H5NXg7bK_gpeDf6lsz5OwEADY54 | execute lab | f      | 2024-03-09 20:18:03.622562 | 2024-03-09 20:39:53.405928 |    21.829722766081492
 HEFS_EnsPost Streamflow                 | f204b61fae534199a78282681a0e6b66 | G8JUtALYtQI9FjiJ6KKsj2TlJlo | execute lab | f      | 2024-03-09 20:41:13.797263 | 2024-03-09 21:21:49.658803 |    40.597692334651946
 HEFS_EnsPost Streamflow                 | f70271250e998ab47f1695a887ddbce5 | TMv1mMPNrx1rYWtN6R7ZuoaA36M | execute lab | f      | 2024-03-09 21:23:24.156849 | 2024-03-10 01:08:43.614805 |     225.3242992679278

To be honest, I'm not sure what's going on. But I'm guessing the clean-and-switch is probably attempting a connection as we speak.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T13:02:34Z


Yes, the clean just ran:

                                         | d41d8cd98f00b204e9800998ecf8427e |                             | cleandataba | f      | 2024-03-12 13:00:18.851283 | 2024-03-12 13:00:26.082375 |   0.12051819960276286

Um... Okay.

Why did I see log messages from before the start when the database came up?

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T13:28:13Z


I'm not sure why I'm seeing old log messages being inserted into the log that are prior to when I started the server. But here is the complete logging related to the restart for posterity:

2024-03-12 12:47:52.286 UTC   [70094]  65f04ef7.111ce LOG:  starting PostgreSQL 13.14 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2024-03-12 12:47:52.289 UTC   [70094]  65f04ef7.111ce LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-12 12:47:52.289 UTC   [70094]  65f04ef7.111ce LOG:  listening on IPv6 address "::", port 5432
2024-03-12 12:47:52.290 UTC   [70094]  65f04ef7.111ce LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
2024-03-12 12:47:52.340 UTC   [70094]  65f04ef7.111ce LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-03-12 12:47:52.375 UTC   [70098]  65f04ef8.111d2 LOG:  database system shutdown was interrupted; last known up at 2024-03-10 01:34:12 UTC
2024-03-12 12:48:23.177 UTC 10.3.13.29(51342) [unknown] [70126] [unknown] 65f04f17.111ee LOG:  connection received: host=10.3.13.29 port=51342
2024-03-12 12:48:23.184 UTC 10.3.13.29(51342) wres8 [70126] wres_user8 65f04f17.111ee FATAL:  the database system is starting up
2024-03-12 12:48:23.187 UTC 10.3.13.29(51344) [unknown] [70127] [unknown] 65f04f17.111ef LOG:  connection received: host=10.3.13.29 port=51344
2024-03-12 12:48:23.187 UTC 10.3.13.29(51344) wres8 [70127] wres_user8 65f04f17.111ef FATAL:  the database system is starting up
2024-03-12 12:49:19.388 UTC   [70098]  65f04ef8.111d2 LOG:  database system was not properly shut down; automatic recovery in progress
2024-03-12 12:49:19.399 UTC   [70098]  65f04ef8.111d2 LOG:  redo starts at 1637/5E61C960
2024-03-12 12:50:09.838 UTC   [70098]  65f04ef8.111d2 LOG:  invalid record length at 1638/1E795A08: wanted 24, got 0
2024-03-12 12:50:09.838 UTC   [70098]  65f04ef8.111d2 LOG:  redo done at 1638/1E7959E0
2024-03-12 12:50:09.841 UTC   [70098]  65f04ef8.111d2 LOG:  checkpoint starting: end-of-recovery immediate
2024-03-12 12:50:23.473 UTC   [70098]  65f04ef8.111d2 LOG:  checkpoint complete: wrote 559422 buffers (58.3%); 0 WAL file(s) added, 109 removed, 83 recycled; write=4.208 s, sync=8.862 s, total=13.635 s; sync files=196, longest=0.589 s, average=0.046 s; distance=3147236 kB, estimate=3147236 kB
2024-03-12 12:50:23.770 UTC   [70094]  65f04ef7.111ce LOG:  database system is ready to accept connections

It took a little over 2 minutes. I guess that's why the @restart@ command timed out.

I'm going to create a JIRA ticket looking for unusual activity on the wresdb-prod02 around the timeframe of 3/10 1:30Z.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T14:31:05Z


Created ITSG-2062 asking for ITSG to look at the systems late on March 9 through early March 10:

https://jira.nws.noaa.gov/servicedesk/customer/portal/1/ITSG-2062

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-03-12T14:52:58Z


Server is back up. I'm reducing this to urgent.

Hank