NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I don't wan the WRES waiting forever when there are issues connecting to Postgres #145

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: Hank (Hank) Original Redmine Issue: 122557, https://vlab.noaa.gov/redmine/issues/122557 Original Date: 2023-11-13


This ticket started as an investigation; see the original description below. The problem discovered was an infinite loop waiting for Postgres connections, described in the comment. This ticket can be resolved was that loop is addressed.

Hank

===================================

There are no records of attempts at a clean database attempt since Nov 11 at 0Z. Here is the wresdb-prod01 starting with its last clean (Nov 10, 12Z):

                                           | d41d8cd98f00b204e9800998ecf8427e | cleandataba | f      | 2023-11-10 12:00:05.840781 | 2023-11-10 12:00:32.078148 | 0.4372894485791524
 RFC Training Example NWM Single-Valued    | b0ec6c973764ed5519bccf238d6a89bf | execute lab | t      | 2023-11-10 12:20:11.782604 | 2023-11-10 12:28:07.106557 |  7.922065881888072
 EnsPost Streamflow                        | fc1a48a9a5a03bb5592cd4ddcbfced52 | execute <?x | f      | 2023-11-10 16:24:29.165899 | 2023-11-10 22:16:51.872885 | 352.37844976584114

And here is the wresdb-prod02 since its last clean:

                                        | d41d8cd98f00b204e9800998ecf8427e | cleandataba | f      | 2023-11-11 00:00:04.953735 | 2023-11-11 00:01:13.633506 | 1.1446628491083781
 NWM_MR_Flow_30day_CSV                  | 1e0f7a6cb59f60e47957b8941703c421 | execute <?x | t      | 2023-11-11 00:03:30.429633 | 2023-11-11 00:12:50.932935 |   9.34172170162201
 NWM_MR_Flow_90day_CSV                  | 5413a50f3dcd7253be52789d952639f7 | execute <?x | f      | 2023-11-11 00:13:06.653355 | 2023-11-11 00:46:25.071855 | 33.306975003083544
 RFC Training Example NWM Single-Valued | eeb4c4f1f27a15f0467329759a2adcb8 | execute lab | t      | 2023-11-11 00:20:17.674774 | 2023-11-11 00:36:39.603246 |  16.36547453403473
 RFC_AHPS_Flow_30day_CSV                | 09f59ad388346eb3d13f6b678fbe2ae3 | execute <?x | f      | 2023-11-11 00:46:53.385458 | 2023-11-11 01:12:06.389794 | 25.216738935311636
 RFC_AHPS_Flow_90day_CSV                | eeab03069feaa871d4768eba2215167c | execute <?x | f      | 2023-11-11 01:12:32.864332 | 2023-11-11 01:51:36.592425 |  39.06213488578796
 RFC Training Example NWM Single-Valued | d3edb96b41d7605f95a078431e973493 | execute lab | t      | 2023-11-11 12:20:19.765966 | 2023-11-11 12:27:18.032585 | 6.9711103161176045
 NWM_MR_Flow_30day_CSV                  | c1efd9a4cc7c3c524f93e08652c5d167 | execute <?x | t      | 2023-11-12 00:05:07.03539  | 2023-11-12 00:14:30.743092 |   9.39512836933136
 NWM_MR_Flow_90day_CSV                  | 8e5ca0680893e81c624990ec77d5c833 | execute <?x | f      | 2023-11-12 00:14:48.490824 | 2023-11-12 00:47:22.863685 |   32.5728810151418
 RFC Training Example NWM Single-Valued | e6afa4af3b04ed167ed36c4d6d9c4439 | execute lab | t      | 2023-11-12 00:20:22.841061 | 2023-11-12 00:31:39.752592 | 11.281858849525452
 RFC_AHPS_Flow_30day_CSV                | cd732d2f17bdbd99f65832dc6237df5a | execute <?x | f      | 2023-11-12 00:48:04.333849 | 2023-11-12 00:57:48.378296 |  9.734074115753174
 RFC_AHPS_Flow_90day_CSV                | 2ac198b1cb3c09ce67abf5ef10041812 | execute <?x | f      | 2023-11-12 00:58:00.671811 | 2023-11-12 01:11:12.374429 | 13.195043631394704
 RFC Training Example NWM Single-Valued | 9eb7a5eeda3f4636cb7438bfba24de71 | execute lab | t      | 2023-11-12 12:20:21.966293 | 2023-11-12 12:27:40.269204 |  7.305048513412475
 NWM_MR_Flow_30day_CSV                  | 9e2c18c026355cc82dbdbeef7e04516c | execute <?x | t      | 2023-11-13 00:02:50.251799 | 2023-11-13 00:13:15.17549  | 10.415394846598307
 NWM_MR_Flow_90day_CSV                  | d7dede50ea9ce41651deaf25077c6f16 | execute <?x | f      | 2023-11-13 00:13:33.19385  | 2023-11-13 00:46:48.874411 | 33.261342684427895
 RFC Training Example NWM Single-Valued | 577fe1ea98b17e9b6164ce8548c4bca2 | execute lab | t      | 2023-11-13 00:20:23.657638 | 2023-11-13 00:32:06.630518 |  11.71621466477712
 RFC_AHPS_Flow_30day_CSV                | 2e72c90d58cca02078dea4681f0f8658 | execute <?x | f      | 2023-11-13 00:47:15.643177 | 2023-11-13 00:54:32.178372 |  7.275586581230163
 RFC_AHPS_Flow_90day_CSV                | 1b6614be555a99f2cbafcd76460520a9 | execute <?x | f      | 2023-11-13 00:54:58.378493 | 2023-11-13 01:04:18.911876 |  9.342223048210144
 RFC Training Example NWM Single-Valued | 19c0d1329edcc084573904191e29a7d7 | execute lab | t      | 2023-11-13 12:20:28.417275 | 2023-11-13 12:27:25.649502 | 6.9538704514503475

Only scheduled evaluations have been executing based on the execution log.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:04:06Z


James and/or Evan: I might need your help with this one, at least in understanding what is happening and why this is happening now (and never before, iirc).

First, checking on going evaluations, I see two:

4565890164075768279 - An EnsPost evaluation started at 2023-11-13T13:06:39.312+0000. 6541405875594635684 - An EnsPost evaluation started at 2023-11-10T22:34:44.374+0000

That second evaluation is the reason why cleans would fail, though I'm surprised they aren't logged. Its stuck in an exception loop that's been going on for days.

Here are the first few:

2023-11-11T06:28:49.209+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@4dcf24d1
2023-11-11T06:28:59.220+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-11T06:29:00.221+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@4dcf24d1
2023-11-11T06:29:10.232+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-11T06:29:11.233+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@4dcf24d1
2023-11-11T06:29:21.244+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted

Here are the most recent ones:

2023-11-12T04:56:46.081+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@62cc8e3a
2023-11-12T04:56:56.092+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-12T04:56:57.092+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@62cc8e3a
2023-11-12T04:57:07.101+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-12T04:57:08.102+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@62cc8e3a
2023-11-12T23:12:25.589+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@56632c9b
2023-11-12T23:12:35.602+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-12T23:12:36.602+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@56632c9b
2023-11-12T23:12:46.612+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-12T23:12:47.613+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@56632c9b
2023-11-12T23:12:57.624+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-12T23:12:58.624+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@56632c9b

The job is still marked in progress. Taking a quick look to make sure that explains why the cleans are failing,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:06:18Z


A smoke test passed, FYI.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:09:40Z


Clean attempts have been repeatedly failing. I confirmed that in the log. Were failed clean attempts not logged as of the current Release 6.16?

I also noticed that clean attempts for what is the active database when failing to clean-and-switch to the other database have not been executing. Why? Because the clean-and-switch scripts for production are out of date. Ugh. They've been out of date since Jan 12. Double ugh.

Let me get the scripts back in date so that we can see attempts to clean the active database when the other database cannot be cleaned.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:12:57Z


Looks like I made a change in staging, pushed the change to the repo, and then attempted to update the production clean-and-switch scripts to what was pushed to production, but copied over an old copy of the scripts. Anyway, that is now fixed in production.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-11-13T14:16:46Z


If the job is still marked as in progress then that would make sense to me why we aren't trying again. Don't think anything I had in 6.16 would affect things like this. Is the prod01 functioning? Seems like we are losing connection repeatedly

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:17:09Z


Superficially, I would say because there is a bug in the @DatabaseLockManagerPostgres@ or its caller. I note that @DatabaseLockManagerPostgres::testAndRefresh@ has a retry sequence, but that sequence is running in a continuous loop via a scheduled task until @DatabaseLockManagerPostgres::shutdown@ is called. So my best guess is that the @finally@ block in the @Evaluator@ didn't reach that shutdown task or that the @finally@ block wasn't itself reached. What was the logging prior to the loop? Anyway, that's a separate ticket.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:20:13Z


Now focusing on the issues with the super long evaluation...

The stdout logging is attached uncompressed. There appears to be some sort of retry with back off thingy happening if you look at the log messages starting from the bottom up.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:21:24Z


James wrote:

Superficially, I would say because there is a bug in the @DatabaseLockManagerPostgres@ or its caller. I note that @DatabaseLockManagerPostgres::testAndRefresh@ has a retry sequence, but that sequence is running in a continuous loop via a scheduled task until @DatabaseLockManagerPostgres::shutdown@ is called. So my best guess is that the @finally@ block in the @Evaluator@ didn't reach that shutdown task or that the @finally@ block wasn't itself reached. What was the logging prior to the loop? Anyway, that's a separate ticket.

Full stdout attached. If you think its another ticket, can you create it?

I think the clean-and-switch issue is fixed, though I still don't understand why failed cleans are not being logged.

Let me look at the -prod01 issue Evan noted,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:24:42Z


Evan wrote:

If the job is still marked as in progress then that would make sense to me why we aren't trying again. Don't think anything I had in 6.16 would affect things like this. Is the prod01 functioning? Seems like we are losing connection repeatedly

The -prod01 server is available and I'm able to connect to it via psql. The -prod02 server is the active one, however, so I'm not sure that a fresh WRES connection would succeed or fail.

If you all think it could be of value, I can spin up a standalone on nwcal-wres-ti01 and attempt to perform an evaluation against the -prod01 Postgres server.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:26:01Z


James:

I actually think this ticket should be repurposed to examining the issues with that EnsPost evaluation database connection. The clean-and-switch not cleaning and switching is due to that problem, though it did point me to an old script still being used in production. Thoughts?

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-11-13T14:27:53Z


If it is up then I don't really think there would be much benefit of doing that in my personal opinion

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:31:22Z


Yes, that's fine, the history is here and I hadn't realized you'd just created this ticket.

Anyway, the superficial issue is that the @DatabaseLockManagerPostgres::testAndRefresh@ is effectively running in an infinite loop until the @shutdown@ is called, due to this instantiation:

        this.connectionMonitorService = Executors.newScheduledThreadPool( 1, monitorServiceNaming );
        Runnable recurringTask = new RefreshConnectionsTask( this );

        // Use fixed delay instead of rate now that task can sleep per source.
        this.connectionMonitorService.scheduleWithFixedDelay( recurringTask,
                                                              REFRESH_FREQUENCY.getSeconds(),
                                                              REFRESH_FREQUENCY.getSeconds(),
                                                              TimeUnit.SECONDS );
</code>

That is probably not a great idea, given that it means an evaluation can persist indefinitely until a connection is reestablished. As to why a connection could not be reestablished, that is another issue.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:33:18Z


How can we debug the why? Would Postgres logs help?

I'm not sure if Check_MK would show this, but let me see if there is anything funky that happened to that server over the weekend,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:37:43Z


Its interesting that it appears to lose connection then reconnect and then lose connection again:

2023-11-11T06:53:46.916+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@1f7dab2e
2023-11-11T09:10:28.983+0000 INFO PoolReporter [126/364] Completed statistics for a pool in feature group 'DWNN6TIF-DWNN6TIF-DWNN6TIF'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-31T23:59:59.999999999Z, Earliest valid time: -1000000000-01-01T00:00:00Z, Latest valid time: +1000000000-12-31T23:59:59.999999999Z, Earliest lead duration: PT3090H, Latest lead duration: PT3114H ).
2023-11-11T09:10:28.987+0000 INFO PoolReporter [127/364] Completed statistics for a pool in feature group 'DWNN6TIF-DWNN6TIF-DWNN6TIF'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-31T23:59:59.999999999Z, Earliest valid time: -1000000000-01-01T00:00:00Z, Latest valid time: +1000000000-12-31T23:59:59.999999999Z, Earliest lead duration: PT3018H, Latest lead duration: PT3042H ).
2023-11-11T09:10:29.004+0000 INFO PoolReporter [128/364] Completed statistics for a pool in feature group 'DWNN6TIF-DWNN6TIF-DWNN6TIF'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-31T23:59:59.999999999Z, Earliest valid time: -1000000000-01-01T00:00:00Z, Latest valid time: +1000000000-12-31T23:59:59.999999999Z, Earliest lead duration: PT3066H, Latest lead duration: PT3090H ).
2023-11-11T09:10:29.012+0000 INFO PoolReporter [129/364] Completed statistics for a pool in feature group 'DWNN6TIF-DWNN6TIF-DWNN6TIF'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-31T23:59:59.999999999Z, Earliest valid time: -1000000000-01-01T00:00:00Z, Latest valid time: +1000000000-12-31T23:59:59.999999999Z, Earliest lead duration: PT3042H, Latest lead duration: PT3066H ).
2023-11-11T09:10:29.039+0000 INFO PoolReporter [130/364] Completed statistics for a pool in feature group 'DWNN6TIF-DWNN6TIF-DWNN6TIF'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-31T23:59:59.999999999Z, Earliest valid time: -1000000000-01-01T00:00:00Z, Latest valid time: +1000000000-12-31T23:59:59.999999999Z, Earliest lead duration: PT3138H, Latest lead duration: PT3162H ).
2023-11-11T09:10:29.047+0000 INFO PoolReporter [131/364] Completed statistics for a pool in feature group 'DWNN6TIF-DWNN6TIF-DWNN6TIF'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-31T23:59:59.999999999Z, Earliest valid time: -1000000000-01-01T00:00:00Z, Latest valid time: +1000000000-12-31T23:59:59.999999999Z, Earliest lead duration: PT3114H, Latest lead duration: PT3138H ).
2023-11-11T09:15:47.495+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@b8d2252
2023-11-11T09:15:57.509+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:40:30Z


The Check_MK CPU utilization shows no major hiccup when the first round of disconnects happen around Nov 11 6:30Z, but shows no activity when that second disconnection happens around Nov 11 9:15Z:

!122557_CPU_Utilization_prod01_20231111.png!

The WRES fails to reconnect again after that,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:42:41Z


So, there are two problems here:

  1. The WRES is losing connection.
  2. The WRES allows the job to go on forever as it tries to reconnect.

It appears that James may have found something to look at for 2; I guess I'll continue to focus on 1.

With the service still functional, I see no reason to stop the job immediately. I'll let it keep running while we investigate this. Note that I'm in meetings 10:30 - Noon.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:44:00Z


Yes, the postgres log. Beyond this, debug logging on the app, but it's too late for that, obvs.

I note that these are not pooled connections, they are raw database connections acquired from @Database::getRawConnection@, but no exception was thrown on attempting that connection.

Could be related to the refactoring of @SystemSettings@ and @DatabaseSettings@, I guess, but it's hard to say without further logging. Could just be a coincidence that we haven't seen the server behavior that led to this until now, but it's an app bug of some kind because a connection could not be reestablished by this evaluation, even though others are succeeding.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-11-13T14:46:02Z


James wrote:

Could be related to the refactoring of @SystemSettings@ and @DatabaseSettings@, I guess, but it's hard to say without further logging. Could just be a coincidence that we haven't seen the server behavior that led to this until now, but it's an app bug of some kind because a connection could not be reestablished by this evaluation, even though others are succeeding.

These changes are not in production so I don't think they are related

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:50:46Z


Probably coincidence, then. Either way, an infinite retry outer loop (with a 5-retry inner loop) probably isn't a great idea. Needs a new ticket, but I also wouldn't say that it's a super high priority until there is further evidence this isn't some weird set of circumstances.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:51:37Z


I guess a cancel op would come in handy about now :)

No need to keep this one alive, I think, we are not going to learn more about it on the app side. Postgres logs are worth look.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:52:33Z


The Postgres logs show disconnections and very long queries. Not sure if this could be due to too-long queries being performed? Anyway, rather than share snippets, I've shared the entire log file for Nov 11. You can look around 6:30Z and 8-9Z.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:53:14Z


Once cancelled, I suggest we resubmit and inform the HEFS team, supplying them with a link to the new job.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:54:42Z


Hank wrote:

The Postgres logs show disconnections and very long queries. Not sure if this could be due to too-long queries being performed? Anyway, rather than share snippets, I've shared the entire log file for Nov 11. You can look around 6:30Z and 8-9Z.

Hank

Nah, the query timeout is on the app side, although it's sent to the server for action, but it's PT5H, so very long and that doesn't lead to connection failures.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-11-13T14:55:53Z


James wrote:

I guess a cancel op would come in handy about now :)

No need to keep this one alive, I think, we are not going to learn more about it on the app side. Postgres logs are worth look.

Yeah, I saw that you closed that other ticket so I should be able to get out that job cancel work today. Doesn't help with the current job, but will be nice for the future

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:57:40Z


Example query (this is the first long one that is logged):

2023-11-11 04:07:41.172 UTC 10.3.10.104(35106) wres8 [27334] wres_user8 654ecbec.6ac6 LOG:  duration: 12755831.719 ms  execute S_5/C_6: SELECT
            metadata.series_id AS series_id,
            metadata.reference_time + INTERVAL '1' MINUTE * TSV.lead AS valid_time,
            metadata.reference_time,
            ARRAY_AGG(TSV.series_value ORDER BY TS.ensemble_id) AS ensemble_members,
            ARRAY_AGG(TS.ensemble_id ORDER BY TS.ensemble_id) AS ensemble_ids,
            metadata.measurementunit_id,
            metadata.scale_period,
            metadata.scale_function,
            metadata.feature_id,
            metadata.occurrences
        FROM
        (
            SELECT
                S.source_id AS series_id,
                MAX( reference_time ) AS reference_time,
                S.feature_id,
                S.measurementunit_id,
                TimeScale.duration_ms AS scale_period,
                TimeScale.function_name AS scale_function,
                COUNT(*) AS occurrences
            FROM wres.Source S
            INNER JOIN wres.ProjectSource PS
                ON PS.source_id = S.source_id
            INNER JOIN wres.TimeSeriesReferenceTime TSRT
                ON TSRT.source_id = S.source_id
            LEFT JOIN wres.TimeScale TimeScale
                ON TimeScale.timescale_id = S.timescale_id
        WHERE PS.project_id = $1
            AND S.variable_name = $2
            AND S.feature_id = $3
            AND PS.member = $4
            GROUP BY S.source_id,
                S.measurementunit_id,
                TimeScale.duration_ms,
                TimeScale.function_name
        ) AS metadata
        INNER JOIN wres.TimeSeries TS
            ON TS.source_id = metadata.series_id
        INNER JOIN wres.TimeSeriesValue TSV
            ON TSV.timeseries_id = TS.timeseries_id
        WHERE TSV.lead > $5
            AND TSV.lead <= $6
            AND ( EXTRACT( MONTH FROM metadata.reference_time ) > $7 OR ( EXTRACT( MONTH FROM metadata.reference_time ) = $8 AND EXTRACT( DAY FROM metadata.reference_time ) >= $9 ) )
            AND ( EXTRACT( MONTH FROM metadata.reference_time ) < $10 OR ( EXTRACT( MONTH FROM metadata.reference_time ) = $11 AND EXTRACT( DAY FROM metadata.reference_time ) <= $12 ) )
        GROUP BY metadata.series_id,metadata.reference_time, metadata.feature_id, TSV.lead, metadata.scale_period, metadata.scale_function, metadata.measurementunit_id,metadata.occurrences

Another one:

2023-11-11 04:07:41.173 UTC 10.3.10.104(35092) wres8 [27190] wres_user8 654ecbd9.6a36 LOG:  duration: 12746447.858 ms  execute S_5/C_7: SELECT
            metadata.series_id AS series_id,
            metadata.reference_time + INTERVAL '1' MINUTE * TSV.lead AS valid_time,
            metadata.reference_time,
            ARRAY_AGG(TSV.series_value ORDER BY TS.ensemble_id) AS ensemble_members,
            ARRAY_AGG(TS.ensemble_id ORDER BY TS.ensemble_id) AS ensemble_ids,
            metadata.measurementunit_id,
            metadata.scale_period,
            metadata.scale_function,
            metadata.feature_id,
            metadata.occurrences
        FROM
        (
            SELECT
                S.source_id AS series_id,
                MAX( reference_time ) AS reference_time,
                S.feature_id,
                S.measurementunit_id,
                TimeScale.duration_ms AS scale_period,
                TimeScale.function_name AS scale_function,
                COUNT(*) AS occurrences
            FROM wres.Source S
            INNER JOIN wres.ProjectSource PS
                ON PS.source_id = S.source_id
            INNER JOIN wres.TimeSeriesReferenceTime TSRT
                ON TSRT.source_id = S.source_id
            LEFT JOIN wres.TimeScale TimeScale
                ON TimeScale.timescale_id = S.timescale_id
        WHERE PS.project_id = $1
            AND S.variable_name = $2
            AND S.feature_id = $3
            AND PS.member = $4
            GROUP BY S.source_id,
                S.measurementunit_id,
                TimeScale.duration_ms,
                TimeScale.function_name
        ) AS metadata
        INNER JOIN wres.TimeSeries TS
            ON TS.source_id = metadata.series_id
        INNER JOIN wres.TimeSeriesValue TSV
            ON TSV.timeseries_id = TS.timeseries_id
        WHERE TSV.lead > $5
            AND TSV.lead <= $6
            AND ( EXTRACT( MONTH FROM metadata.reference_time ) > $7 OR ( EXTRACT( MONTH FROM metadata.reference_time ) = $8 AND EXTRACT( DAY FROM metadata.reference_time ) >= $9 ) )
            AND ( EXTRACT( MONTH FROM metadata.reference_time ) < $10 OR ( EXTRACT( MONTH FROM metadata.reference_time ) = $11 AND EXTRACT( DAY FROM metadata.reference_time ) <= $12 ) )
        GROUP BY metadata.series_id,metadata.reference_time, metadata.feature_id, TSV.lead, metadata.scale_period, metadata.scale_function, metadata.measurementunit_id,metadata.occurrences

Here is a nasty one later in the log:

2023-11-11 08:33:09.767 UTC 10.3.10.104(35122) wres8 [27362] wres_user8 654ecc06.6ae2 LOG:  duration: 21882812.714 ms  plan:
        Query Text: SELECT
            metadata.series_id AS series_id,
            metadata.reference_time + INTERVAL '1' MINUTE * TSV.lead AS valid_time,
            metadata.reference_time,
            ARRAY_AGG(TSV.series_value ORDER BY TS.ensemble_id) AS ensemble_members,
            ARRAY_AGG(TS.ensemble_id ORDER BY TS.ensemble_id) AS ensemble_ids,
            metadata.measurementunit_id,
            metadata.scale_period,
            metadata.scale_function,
            metadata.feature_id,
            metadata.occurrences
        FROM
        (
            SELECT
                S.source_id AS series_id,
                MAX( reference_time ) AS reference_time,
                S.feature_id,
                S.measurementunit_id,
                TimeScale.duration_ms AS scale_period,
                TimeScale.function_name AS scale_function,
                COUNT(*) AS occurrences
            FROM wres.Source S
            INNER JOIN wres.ProjectSource PS
                ON PS.source_id = S.source_id
            INNER JOIN wres.TimeSeriesReferenceTime TSRT
                ON TSRT.source_id = S.source_id
            LEFT JOIN wres.TimeScale TimeScale
                ON TimeScale.timescale_id = S.timescale_id
        WHERE PS.project_id = $1
            AND S.variable_name = $2
            AND S.feature_id = $3
            AND PS.member = $4
            GROUP BY S.source_id,
                S.measurementunit_id,
                TimeScale.duration_ms,
                TimeScale.function_name
        ) AS metadata
        INNER JOIN wres.TimeSeries TS
            ON TS.source_id = metadata.series_id
        INNER JOIN wres.TimeSeriesValue TSV
            ON TSV.timeseries_id = TS.timeseries_id
        WHERE TSV.lead > $5
            AND TSV.lead <= $6
            AND ( EXTRACT( MONTH FROM metadata.reference_time ) > $7 OR ( EXTRACT( MONTH FROM metadata.reference_time ) = $8 AND EXTRACT( DAY FROM metadata.reference_time ) >= $9 ) )
            AND ( EXTRACT( MONTH FROM metadata.reference_time ) < $10 OR ( EXTRACT( MONTH FROM metadata.reference_time ) = $11 AND EXTRACT( DAY FROM metadata.reference_time ) <= $12 ) )
        GROUP BY metadata.series_id,metadata.reference_time, metadata.feature_id, TSV.lead, metadata.scale_period, metadata.scale_function, metadata.measurementunit_id,metadata.occurrences

        GroupAggregate  (cost=3458.59..7686528.38 rows=40000 width=120)
          Group Key: s.source_id, (max(tsrt.reference_time)), s.feature_id, tsv.lead, timescale.duration_ms, timescale.function_name, s.measurementunit_id, (count(*))
          ->  Incremental Sort  (cost=3458.59..7678600.85 rows=281101 width=60)
                Sort Key: s.source_id, (max(tsrt.reference_time)), s.feature_id, tsv.lead, timescale.duration_ms, timescale.function_name, s.measurementunit_id, (count(*))
                Presorted Key: s.source_id
                ->  Nested Loop  (cost=3274.68..7666864.97 rows=281101 width=60)
                      ->  Nested Loop  (cost=3274.10..15222.39 rows=34134 width=52)
                            ->  GroupAggregate  (cost=3273.67..3600.01 rows=521 width=44)
                                  Group Key: s.source_id, timescale.duration_ms, timescale.function_name
                                  Filter: (((date_part('month'::text, max(tsrt.reference_time)) > '1'::double precision) OR ((date_part('month'::text, max(tsrt.reference_time)) = '1'::double precision) AND (date_part('day'::text, max(tsrt.reference_time)) >= '1'::double precision))) AND ((date_part('month'::text, max(tsrt.reference_time)) < '12'::double precision) OR ((date_part('month'::text, max(tsrt.reference_time)) = '12'::double precision) AND (date_part('day'::text, max(tsrt.reference_time)) <= '31'::double precision))))
                                  ->  Sort  (cost=3273.67..3285.33 rows=4662 width=36)
                                        Sort Key: s.source_id, timescale.duration_ms, timescale.function_name
                                        ->  Hash Left Join  (cost=2004.19..2989.60 rows=4662 width=36)
                                              Hash Cond: (s.timescale_id = timescale.timescale_id)
                                              ->  Hash Join  (cost=2002.99..2970.40 rows=4662 width=26)
                                                    Hash Cond: (tsrt.source_id = ps.source_id)
                                                    ->  Seq Scan on timeseriesreferencetime tsrt  (cost=0.00..753.39 rows=44639 width=16)
                                                    ->  Hash  (cost=1944.72..1944.72 rows=4662 width=26)
                                                          ->  Hash Join  (cost=528.21..1944.72 rows=4662 width=26)
                                                                Hash Cond: (s.source_id = ps.source_id)
                                                                ->  Index Scan using source_feature_index on source s  (cost=0.29..1296.07 rows=19762 width=18)
                                                                      Index Cond: (feature_id = '2083372'::bigint)
                                                                      Filter: ((variable_name)::text = 'SQIN'::text)
                                                                ->  Hash  (cost=396.29..396.29 rows=10530 width=8)
                                                                      ->  Index Scan using projectsource_project_index on projectsource ps  (cost=0.29..396.29 rows=10530 width=8)
                                                                            Index Cond: ((project_id = '5144'::bigint) AND ((member)::text = 'right'::text))
                                              ->  Hash  (cost=1.09..1.09 rows=9 width=18)
                                                    ->  Seq Scan on timescale  (cost=0.00..1.09 rows=9 width=18)
                            ->  Index Scan using timeseries_source_ensemble_unique_index on timeseries ts  (cost=0.43..21.64 rows=66 width=16)
                                  Index Cond: (source_id = s.source_id)
                      ->  Index Scan using timeseriesvalue_lead_above_150_timeseries_id_lead_idx on timeseriesvalue_lead_above_150 tsv  (cost=0.58..217.97 rows=619 width=16)
                            Index Cond: ((timeseries_id = ts.timeseries_id) AND (lead > '186840'::bigint) AND (lead <= '189720'::bigint))

Anyway, you all are probably better at understanding these logs.

Again, it seems like funky IT things may be happening on the machines over the weekend. But questions asking about what is going on from them almost never yield fruit... if they even yield a response!

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T14:58:26Z


I'll cancel the job the old fashioned way (find the container, kill from within the container) and then continue to search for what happened over the weekend.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T14:59:32Z


Odd cycle.

2023-11-11 10:09:55.311 UTC 10.3.10.104(39830) [unknown] [9538] [unknown] 654f52f3.2542 LOG:  connection received: host=10.3.10.104 port=39830
2023-11-11 10:09:55.326 UTC 10.3.10.104(39830) wres8 [9538] wres_user8 654f52f3.2542 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:18:32.999 UTC 10.3.22.1(59966) [unknown] [10879] [unknown] 654f54f8.2a7f LOG:  connection received: host=10.3.22.1 port=59966
2023-11-11 10:18:33.003 UTC 10.3.22.1(59966) [unknown] [10879] [unknown] 654f54f8.2a7f LOG:  invalid length of startup packet
2023-11-11 10:18:33.015 UTC 10.3.22.1(59970) [unknown] [10882] [unknown] 654f54f9.2a82 LOG:  connection received: host=10.3.22.1 port=59970
2023-11-11 10:18:33.019 UTC 10.3.22.1(59970) [unknown] [10882] [unknown] 654f54f9.2a82 LOG:  invalid length of startup packet
2023-11-11 10:18:33.030 UTC 10.3.22.1(59976) [unknown] [10883] [unknown] 654f54f9.2a83 LOG:  connection received: host=10.3.22.1 port=59976
2023-11-11 10:18:33.034 UTC 10.3.22.1(59976) [unknown] [10883] [unknown] 654f54f9.2a83 LOG:  invalid length of startup packet
2023-11-11 10:18:33.041 UTC 10.3.22.1(59980) [unknown] [10884] [unknown] 654f54f9.2a84 LOG:  connection received: host=10.3.22.1 port=59980
2023-11-11 10:18:33.045 UTC 10.3.22.1(59980) [unknown] [10884] [unknown] 654f54f9.2a84 LOG:  invalid length of startup packet
2023-11-11 10:18:33.057 UTC 10.3.22.1(59984) [unknown] [10885] [unknown] 654f54f9.2a85 LOG:  connection received: host=10.3.22.1 port=59984
2023-11-11 10:18:33.061 UTC 10.3.22.1(59984) [unknown] [10885] [unknown] 654f54f9.2a85 LOG:  invalid length of startup packet
2023-11-11 10:18:33.072 UTC 10.3.22.1(59990) [unknown] [10886] [unknown] 654f54f9.2a86 LOG:  connection received: host=10.3.22.1 port=59990
2023-11-11 10:18:33.077 UTC 10.3.22.1(59990) [unknown] [10886] [unknown] 654f54f9.2a86 LOG:  invalid length of startup packet
2023-11-11 10:18:33.088 UTC 10.3.22.1(59994) [unknown] [10887] [unknown] 654f54f9.2a87 LOG:  connection received: host=10.3.22.1 port=59994
2023-11-11 10:18:33.095 UTC 10.3.22.1(59994) [unknown] [10887] [unknown] 654f54f9.2a87 LOG:  invalid length of startup packet
2023-11-11 10:18:33.098 UTC 10.3.22.1(59998) [unknown] [10888] [unknown] 654f54f9.2a88 LOG:  connection received: host=10.3.22.1 port=59998
2023-11-11 10:18:33.104 UTC 10.3.22.1(59998) [unknown] [10888] [unknown] 654f54f9.2a88 LOG:  invalid length of startup packet
2023-11-11 10:18:33.107 UTC 10.3.22.1(60002) [unknown] [10889] [unknown] 654f54f9.2a89 LOG:  connection received: host=10.3.22.1 port=60002
2023-11-11 10:18:33.110 UTC 10.3.22.1(60002) [unknown] [10889] [unknown] 654f54f9.2a89 LOG:  invalid length of startup packet
2023-11-11 10:18:33.113 UTC 10.3.22.1(60006) [unknown] [10890] [unknown] 654f54f9.2a8a LOG:  connection received: host=10.3.22.1 port=60006
2023-11-11 10:18:34.567 UTC 10.3.22.1(60006) [unknown] [10890] [unknown] 654f54f9.2a8a LOG:  invalid length of startup packet
2023-11-11 10:18:44.332 UTC 10.3.22.1(60204) [unknown] [10902] [unknown] 654f5504.2a96 LOG:  connection received: host=10.3.22.1 port=60204
2023-11-11 10:18:44.335 UTC 10.3.22.1(60204) [unknown] [10902] [unknown] 654f5504.2a96 LOG:  invalid length of startup packet
2023-11-11 10:18:44.348 UTC 10.3.22.1(60208) [unknown] [10904] [unknown] 654f5504.2a98 LOG:  connection received: host=10.3.22.1 port=60208
2023-11-11 10:18:44.348 UTC 10.3.22.1(60208) [unknown] [10904] [unknown] 654f5504.2a98 LOG:  invalid length of startup packet
2023-11-11 10:18:44.349 UTC 10.3.22.1(60210) [unknown] [10905] [unknown] 654f5504.2a99 LOG:  connection received: host=10.3.22.1 port=60210
2023-11-11 10:18:44.350 UTC 10.3.22.1(60210) nessus [10905] postgresql_detect.nasl 654f5504.2a99 FATAL:  no pg_hba.conf entry for host "10.3.22.1", user "postgresql_detect.nasl", database "nessus", SSL off
2023-11-11 10:18:44.353 UTC 10.3.22.1(60212) [unknown] [10906] [unknown] 654f5504.2a9a LOG:  connection received: host=10.3.22.1 port=60212
2023-11-11 10:18:44.357 UTC 10.3.22.1(60212) [unknown] [10906] [unknown] 654f5504.2a9a LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.357 UTC 10.3.22.1(60212) [unknown] [10906] [unknown] 654f5504.2a9a HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.364 UTC 10.3.22.1(60214) [unknown] [10907] [unknown] 654f5504.2a9b LOG:  connection received: host=10.3.22.1 port=60214
2023-11-11 10:18:44.365 UTC 10.3.22.1(60214) [unknown] [10907] [unknown] 654f5504.2a9b LOG:  could not accept SSL connection: no shared cipher
2023-11-11 10:18:44.368 UTC 10.3.22.1(60216) [unknown] [10908] [unknown] 654f5504.2a9c LOG:  connection received: host=10.3.22.1 port=60216
2023-11-11 10:18:44.368 UTC 10.3.22.1(60216) [unknown] [10908] [unknown] 654f5504.2a9c LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.368 UTC 10.3.22.1(60216) [unknown] [10908] [unknown] 654f5504.2a9c HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.373 UTC 10.3.22.1(60218) [unknown] [10909] [unknown] 654f5504.2a9d LOG:  connection received: host=10.3.22.1 port=60218
2023-11-11 10:18:44.373 UTC 10.3.22.1(60218) [unknown] [10909] [unknown] 654f5504.2a9d LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.373 UTC 10.3.22.1(60218) [unknown] [10909] [unknown] 654f5504.2a9d HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.377 UTC 10.3.22.1(60220) [unknown] [10910] [unknown] 654f5504.2a9e LOG:  connection received: host=10.3.22.1 port=60220
2023-11-11 10:18:44.378 UTC 10.3.22.1(60220) [unknown] [10910] [unknown] 654f5504.2a9e LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.378 UTC 10.3.22.1(60220) [unknown] [10910] [unknown] 654f5504.2a9e HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.382 UTC 10.3.22.1(60222) [unknown] [10911] [unknown] 654f5504.2a9f LOG:  connection received: host=10.3.22.1 port=60222
2023-11-11 10:18:44.383 UTC 10.3.22.1(60222) [unknown] [10911] [unknown] 654f5504.2a9f LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.383 UTC 10.3.22.1(60222) [unknown] [10911] [unknown] 654f5504.2a9f HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.388 UTC 10.3.22.1(60224) [unknown] [10912] [unknown] 654f5504.2aa0 LOG:  connection received: host=10.3.22.1 port=60224
2023-11-11 10:18:44.388 UTC 10.3.22.1(60224) [unknown] [10912] [unknown] 654f5504.2aa0 LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.388 UTC 10.3.22.1(60224) [unknown] [10912] [unknown] 654f5504.2aa0 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.393 UTC 10.3.22.1(60226) [unknown] [10913] [unknown] 654f5504.2aa1 LOG:  connection received: host=10.3.22.1 port=60226
2023-11-11 10:18:44.393 UTC 10.3.22.1(60226) [unknown] [10913] [unknown] 654f5504.2aa1 LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.393 UTC 10.3.22.1(60226) [unknown] [10913] [unknown] 654f5504.2aa1 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.398 UTC 10.3.22.1(60228) [unknown] [10914] [unknown] 654f5504.2aa2 LOG:  connection received: host=10.3.22.1 port=60228
2023-11-11 10:18:44.398 UTC 10.3.22.1(60228) [unknown] [10914] [unknown] 654f5504.2aa2 LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.398 UTC 10.3.22.1(60228) [unknown] [10914] [unknown] 654f5504.2aa2 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.403 UTC 10.3.22.1(60230) [unknown] [10915] [unknown] 654f5504.2aa3 LOG:  connection received: host=10.3.22.1 port=60230
2023-11-11 10:18:44.404 UTC 10.3.22.1(60230) [unknown] [10915] [unknown] 654f5504.2aa3 LOG:  could not accept SSL connection: unknown protocol
2023-11-11 10:18:44.404 UTC 10.3.22.1(60230) [unknown] [10915] [unknown] 654f5504.2aa3 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.2.
2023-11-11 10:18:44.409 UTC 10.3.22.1(60232) [unknown] [10916] [unknown] 654f5504.2aa4 LOG:  connection received: host=10.3.22.1 port=60232
2023-11-11 10:18:45.411 UTC 10.3.22.1(60232) [unknown] [10916] [unknown] 654f5504.2aa4 LOG:  could not accept SSL connection: Connection reset by peer
2023-11-11 10:18:45.415 UTC 10.3.22.1(60236) [unknown] [10917] [unknown] 654f5505.2aa5 LOG:  connection received: host=10.3.22.1 port=60236
2023-11-11 10:18:46.438 UTC 10.3.22.1(60236) [unknown] [10917] [unknown] 654f5505.2aa5 LOG:  could not accept SSL connection: Connection reset by peer
2023-11-11 10:25:07.279 UTC 10.3.10.104(39734) wres8 [7402] wres_user8 654f4f9a.1cea LOG:  disconnection: session time: 0:29:28.669 user=wres_user8 database=wres8 host=10.3.10.104 port=39734
2023-11-11 10:25:07.283 UTC 10.3.10.104(39920) [unknown] [11997] [unknown] 654f5683.2edd LOG:  connection received: host=10.3.10.104 port=39920
2023-11-11 10:25:07.300 UTC 10.3.10.104(39920) wres8 [11997] wres_user8 654f5683.2edd LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:25:21.493 UTC 10.3.10.104(39730) wres8 [7401] wres_user8 654f4f8c.1ce9 LOG:  disconnection: session time: 0:29:56.729 user=wres_user8 database=wres8 host=10.3.10.104 port=39730
2023-11-11 10:25:21.495 UTC 10.3.10.104(39926) [unknown] [11998] [unknown] 654f5691.2ede LOG:  connection received: host=10.3.10.104 port=39926
2023-11-11 10:25:21.510 UTC 10.3.10.104(39926) wres8 [11998] wres_user8 654f5691.2ede LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:27:20.983 UTC 10.3.10.104(39746) wres8 [7768] wres_user8 654f502a.1e58 LOG:  disconnection: session time: 0:29:18.403 user=wres_user8 database=wres8 host=10.3.10.104 port=39746
2023-11-11 10:27:20.987 UTC 10.3.10.104(39936) [unknown] [12250] [unknown] 654f5708.2fda LOG:  connection received: host=10.3.10.104 port=39936
2023-11-11 10:27:21.003 UTC 10.3.10.104(39936) wres8 [12250] wres_user8 654f5708.2fda LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:27:23.398 UTC 10.3.10.104(39742) wres8 [7762] wres_user8 654f5008.1e52 LOG:  disconnection: session time: 0:29:54.864 user=wres_user8 database=wres8 host=10.3.10.104 port=39742
2023-11-11 10:27:23.400 UTC 10.3.10.104(39938) [unknown] [12251] [unknown] 654f570b.2fdb LOG:  connection received: host=10.3.10.104 port=39938
2023-11-11 10:27:23.415 UTC 10.3.10.104(39938) wres8 [12251] wres_user8 654f570b.2fdb LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:27:50.397 UTC 10.3.10.104(39752) wres8 [7889] wres_user8 654f503a.1ed1 LOG:  disconnection: session time: 0:29:32.336 user=wres_user8 database=wres8 host=10.3.10.104 port=39752
2023-11-11 10:27:50.399 UTC 10.3.10.104(39942) [unknown] [12360] [unknown] 654f5726.3048 LOG:  connection received: host=10.3.10.104 port=39942
2023-11-11 10:27:50.413 UTC 10.3.10.104(39942) wres8 [12360] wres_user8 654f5726.3048 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:39:13.487 UTC 10.3.10.104(39820) wres8 [9533] wres_user8 654f52e8.253d LOG:  disconnection: session time: 0:29:29.296 user=wres_user8 database=wres8 host=10.3.10.104 port=39820
2023-11-11 10:39:13.491 UTC 10.3.10.104(40010) [unknown] [14110] [unknown] 654f59d1.371e LOG:  connection received: host=10.3.10.104 port=40010
2023-11-11 10:39:13.507 UTC 10.3.10.104(40010) wres8 [14110] wres_user8 654f59d1.371e LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:39:20.998 UTC 10.3.10.104(39824) wres8 [9535] wres_user8 654f52ed.253f LOG:  disconnection: session time: 0:29:31.787 user=wres_user8 database=wres8 host=10.3.10.104 port=39824
2023-11-11 10:39:21.000 UTC 10.3.10.104(40014) [unknown] [14111] [unknown] 654f59d8.371f LOG:  connection received: host=10.3.10.104 port=40014
2023-11-11 10:39:21.015 UTC 10.3.10.104(40014) wres8 [14111] wres_user8 654f59d8.371f LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:39:22.306 UTC 10.3.10.104(39816) wres8 [9530] wres_user8 654f52da.253a LOG:  disconnection: session time: 0:29:51.995 user=wres_user8 database=wres8 host=10.3.10.104 port=39816
2023-11-11 10:39:22.307 UTC 10.3.10.104(40016) [unknown] [14112] [unknown] 654f59da.3720 LOG:  connection received: host=10.3.10.104 port=40016
2023-11-11 10:39:22.322 UTC 10.3.10.104(40016) wres8 [14112] wres_user8 654f59da.3720 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:39:24.936 UTC 10.3.10.104(39826) wres8 [9536] wres_user8 654f52ee.2540 LOG:  disconnection: session time: 0:29:34.090 user=wres_user8 database=wres8 host=10.3.10.104 port=39826
2023-11-11 10:39:24.938 UTC 10.3.10.104(40018) [unknown] [14114] [unknown] 654f59dc.3722 LOG:  connection received: host=10.3.10.104 port=40018
2023-11-11 10:39:24.953 UTC 10.3.10.104(40018) wres8 [14114] wres_user8 654f59dc.3722 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:39:39.524 UTC 10.3.10.104(39828) wres8 [9537] wres_user8 654f52f1.2541 LOG:  disconnection: session time: 0:29:46.037 user=wres_user8 database=wres8 host=10.3.10.104 port=39828
2023-11-11 10:39:39.527 UTC 10.3.10.104(40020) [unknown] [14115] [unknown] 654f59eb.3723 LOG:  connection received: host=10.3.10.104 port=40020
2023-11-11 10:39:39.543 UTC 10.3.10.104(40020) wres8 [14115] wres_user8 654f59eb.3723 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:39:46.890 UTC 10.3.10.104(39822) wres8 [9534] wres_user8 654f52ec.253e LOG:  disconnection: session time: 0:29:58.252 user=wres_user8 database=wres8 host=10.3.10.104 port=39822
2023-11-11 10:39:46.893 UTC 10.3.10.104(40024) [unknown] [14117] [unknown] 654f59f2.3725 LOG:  connection received: host=10.3.10.104 port=40024
2023-11-11 10:39:46.909 UTC 10.3.10.104(40024) wres8 [14117] wres_user8 654f59f2.3725 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:39:52.771 UTC 10.3.10.104(39830) wres8 [9538] wres_user8 654f52f3.2542 LOG:  disconnection: session time: 0:29:57.460 user=wres_user8 database=wres8 host=10.3.10.104 port=39830
2023-11-11 10:39:52.773 UTC 10.3.10.104(40026) [unknown] [14118] [unknown] 654f59f8.3726 LOG:  connection received: host=10.3.10.104 port=40026
2023-11-11 10:39:52.788 UTC 10.3.10.104(40026) wres8 [14118] wres_user8 654f59f8.3726 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:54:52.771 UTC 10.3.10.104(39920) wres8 [11997] wres_user8 654f5683.2edd LOG:  disconnection: session time: 0:29:45.488 user=wres_user8 database=wres8 host=10.3.10.104 port=39920
2023-11-11 10:54:52.775 UTC 10.3.10.104(40112) [unknown] [16313] [unknown] 654f5d7c.3fb9 LOG:  connection received: host=10.3.10.104 port=40112
2023-11-11 10:54:52.791 UTC 10.3.10.104(40112) wres8 [16313] wres_user8 654f5d7c.3fb9 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:55:17.458 UTC 10.3.10.104(39926) wres8 [11998] wres_user8 654f5691.2ede LOG:  disconnection: session time: 0:29:55.963 user=wres_user8 database=wres8 host=10.3.10.104 port=39926
2023-11-11 10:55:17.460 UTC 10.3.10.104(40118) [unknown] [16435] [unknown] 654f5d95.4033 LOG:  connection received: host=10.3.10.104 port=40118
2023-11-11 10:55:17.475 UTC 10.3.10.104(40118) wres8 [16435] wres_user8 654f5d95.4033 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:57:07.281 UTC 10.3.10.104(39936) wres8 [12250] wres_user8 654f5708.2fda LOG:  disconnection: session time: 0:29:46.294 user=wres_user8 database=wres8 host=10.3.10.104 port=39936
2023-11-11 10:57:07.285 UTC 10.3.10.104(40130) [unknown] [16576] [unknown] 654f5e03.40c0 LOG:  connection received: host=10.3.10.104 port=40130
2023-11-11 10:57:07.301 UTC 10.3.10.104(40130) wres8 [16576] wres_user8 654f5e03.40c0 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:57:19.092 UTC 10.3.10.104(39938) wres8 [12251] wres_user8 654f570b.2fdb LOG:  disconnection: session time: 0:29:55.693 user=wres_user8 database=wres8 host=10.3.10.104 port=39938
2023-11-11 10:57:19.094 UTC 10.3.10.104(40134) [unknown] [16686] [unknown] 654f5e0f.412e LOG:  connection received: host=10.3.10.104 port=40134
2023-11-11 10:57:19.110 UTC 10.3.10.104(40134) wres8 [16686] wres_user8 654f5e0f.412e LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 10:57:30.743 UTC 10.3.10.104(39942) wres8 [12360] wres_user8 654f5726.3048 LOG:  disconnection: session time: 0:29:40.344 user=wres_user8 database=wres8 host=10.3.10.104 port=39942
2023-11-11 10:57:30.745 UTC 10.3.10.104(40136) [unknown] [16794] [unknown] 654f5e1a.419a LOG:  connection received: host=10.3.10.104 port=40136
2023-11-11 10:57:30.759 UTC 10.3.10.104(40136) wres8 [16794] wres_user8 654f5e1a.419a LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 11:08:37.403 UTC 10.3.10.104(40014) wres8 [14111] wres_user8 654f59d8.371f LOG:  disconnection: session time: 0:29:16.403 user=wres_user8 database=wres8 host=10.3.10.104 port=40014
2023-11-11 11:08:37.408 UTC 10.3.10.104(40202) [unknown] [18439] [unknown] 654f60b5.4807 LOG:  connection received: host=10.3.10.104 port=40202
2023-11-11 11:08:37.424 UTC 10.3.10.104(40202) wres8 [18439] wres_user8 654f60b5.4807 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 11:08:56.321 UTC 10.3.10.104(40010) wres8 [14110] wres_user8 654f59d1.371e LOG:  disconnection: session time: 0:29:42.831 user=wres_user8 database=wres8 host=10.3.10.104 port=40010
2023-11-11 11:08:56.323 UTC 10.3.10.104(40206) [unknown] [18441] [unknown] 654f60c8.4809 LOG:  connection received: host=10.3.10.104 port=40206
2023-11-11 11:08:56.338 UTC 10.3.10.104(40206) wres8 [18441] wres_user8 654f60c8.4809 LOG:  connection authorized: user=wres_user8 database=wres8 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES128-GCM-SHA256, bits=128, compression=off)
2023-11-11 11:09:05.970 UTC 10.3.10.104(40016) wres8 [14112] wres_user8 654f59da.3720 LOG:  disconnection: session time: 0:29:43.663 user=wres_user8 database=wres8 host=10.3.10.104 port=40016

Etc.

Attempts from host @10.3.22.1@ are weird.

2023-11-11 10:18:44.350 UTC 10.3.22.1(60210) nessus [10905] postgresql_detect.nasl 654f5504.2a99 FATAL:  no pg_hba.conf entry for host "10.3.22.1", user "postgresql_detect.nasl", database "nessus", SSL off

Then a bunch of successes and disconnects from host @10.3.10.104@.

No idea.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T15:03:21Z


The evaluation is now in a failed state.

I'll send Haksu (cc Mark) and email letting him know that the evaluation failed and offer to post it again for him. He might find it easier to repost it himself since its all scripted and outputs are obtained automatically.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T15:04:20Z


Hank wrote:

Here is a nasty one later in the log:

Ah, OK, that one is potentially interesting, as it's longer than the query timeout and shortly before the time the connections were dropped.

2023-11-11 08:33:09.767 UTC

Reducing the query timeout might be a possible avenue to reproduce, then. What does the postgres log say shortly after that query, though? It appears to have succeeded, has not been timed out.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T15:11:04Z


Attempts from host 10.3.22.1 are weird.

That tenable, apparently, based on a @dig -x@. Security scans. Could that be the issue?

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T15:12:38Z


The other IP address, #.#.#.104, is -prod03.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T15:13:05Z


Oh, right, makes sense. No, I don't think so. Postgres is rejecting those connections.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T15:16:52Z


The other connects/disconnects may be nominal too. Anyway, not personally willing to sink a ton of time into this currently. Will probably be a hard one to reproduce. That said, I don't think infinite retries are a good idea, as evidenced here.

Worth bearing in mind, though, that once we have a long-running worker server, db connection retries will change from a concern about a single evaluation to one about the worker server itself, so it may be worth performing some testing in that context.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-11-13T15:24:37Z


James wrote:

The other connects/disconnects may be nominal too. Anyway, not personally willing to sink a ton of time into this currently. Will probably be a hard one to reproduce. That said, I don't think infinite retries are a good idea, as evidenced here.

Worth bearing in mind, though, that once we have a long-running worker server, db connection retries will change from a concern about a single evaluation to one about the worker server itself, so it may be worth performing some testing in that context.

I think that we could put something like @database.getRawConnection()@ inside of the heartbeat that Docker is doing potentially to test connection to the database. That way if we lose database connection we will restart the container and terminate the connection. Althought it looks like above we may have needed to shutdown the database and not just terminate the connection to fix this though based on the above converstaion

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T15:37:47Z


Lowering the priority. We still don't know why the raw connections could not be obtained, but we did seemingly spot an issue in the code that needs addressing. It would be good if this could be done for 6.17.

Time for me to focus on my upcoming meetings,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T15:45:51Z


Evan wrote:

James wrote:

The other connects/disconnects may be nominal too. Anyway, not personally willing to sink a ton of time into this currently. Will probably be a hard one to reproduce. That said, I don't think infinite retries are a good idea, as evidenced here.

Worth bearing in mind, though, that once we have a long-running worker server, db connection retries will change from a concern about a single evaluation to one about the worker server itself, so it may be worth performing some testing in that context.

I think that we could put something like @database.getRawConnection()@ inside of the heartbeat that Docker is doing potentially to test connection to the database. That way if we lose database connection we will restart the container and terminate the connection. Althought it looks like above we may have needed to shutdown the database and not just terminate the connection to fix this though based on the above converstaion

I don't think we can/should do that because the database is an inner concern of the app. In fact, there is no requirement to use a database at all. What I meant was that, in the context of a long-running work server, the database is instantiated by the worker server and not the core app, so the choreography is slightly different. That said, I'm not sure it makes any difference or creates any new routes for testing because the lock manager is created and called by the core app, except when exclusive locking to clean, which the worker server will be doing.

Either way, we need to ensure that retries follow a sane, finite loop.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T18:26:43Z


I think the cause for the failed evaluation this past weekend may be related to #122322. I'll post more in that ticket,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T18:32:48Z


Actually, since that is user-facing, I'll post what ITSG is reporting in ITSG-1315 here before formulating how we respond to the HEFS team in #122322.

Nikki just told me (and said he would post it to the IT ticket soon) that it was due to system backups. Full backups occur once a week on Saturdays apparently. That must bring the system to a (near) halt, which, on Nov 4, resulted in a 6-hour evaluation taking 20 hours (see #122322), and, on Nov 11, resulted in an evaluation failing to talk to the database and eventually freezing. At least, that's my theory.

I need to go to a doctor and am then taking the rest of the day off, but I wanted to post the explanation to give y'all something to think about.

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T18:36:15Z


Interesting, but that doesn't get us off the hook regarding the behavior reported here, i.e., an infinite loop of failed connections while the db is up and accepting connections (that is two issues, one of which we understand, one of which we don't).

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-13T18:37:52Z


( And it's certainly good to share this new info from IT with our HEFS users by way of partial explanation. )

epag commented 2 months ago

Original Redmine Comment Author Name: Mark (Mark) Original Date: 2023-11-13T19:02:20Z


WRES Team/Hank, Thanks for looking into our issue(s).
Hank, re. "Nikki just told me (and said he would post it to the IT ticket soon) that it was due to system backups. Full backups occur once a week on Saturdays apparently." Is that "Full backup" for all OWP VMs, or just a subset, which would include WRES, e.g. 'operational stack'?

Mark

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-13T20:25:24Z


(added Mark as a watcher, since I'm not sure he would see this response otherwise.)

Mark:

The backups are "full" in that the disks are backed up in full. This is to delineate from the incremental backups they perform other days, that I guess are only to pick up changes since the last full backup. I know WRES -prod and DMZ resources are included in that backup.

Nikki tells me, "Dev and TI are not backed up at all." However, I had thought you all had established backups for the CEE a while ago. Am I wrong on that? If so, you may want to double check with Nikki/ITSG if the CEE is backed up. Nikki could have taken my question as referring to only WRES -dev and -ti.

And the fact that -ti is not backed up is a bit concerning. We may need to look into that.

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-11-14T14:54:53Z


Was this infinite connection loop closed yet? Or should we move this to 6.18?

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-14T15:46:33Z


I don't believe so.

We can delay it to 6.18 and repurpose this ticket. We understnd why the clean-and-switch wasn't working: an evaluation was stuck. That evaluation was stuck due to the infinite loop, which occurred because of backups interfering with the responsiveness of the database and the WRES not reacting well.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-12-15T16:31:46Z


Slipping to 6.19.

This is probably a low-hanging fruit, right?

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-02-16T15:26:25Z


Delaying again. Should probably just place on the backlog.