As a user, when I run an evaluation through the COWRES that completes successfully, I expect to be able to see output through the front-end (persister bottleneck) - Githubissues

epag commented 3 weeks ago

Author Name: Hank (Hank) Original Redmine Issue: 122858, https://vlab.noaa.gov/redmine/issues/122858 Original Date: 2023-11-21 Original Assignee: Hank

The evaluation is another CNRFC D-Store evaluation: 8078728001050088409. The job succeeded, and I can see the output under @/mnt/wres_share/evaluations/wres_job_8078728001050088409@, but the front-end does not return any list in @.../job/8078728001050088409/output@.

This ticket can be resolved once we determine why the output is not there and whether or not a fix is required, noting that the behavior may change with the worker-server deployment in 6.17.

Investigation is immediate; fix is tbd. Guessing 6.18. Thanks,

Hank

Redmine related issue(s): 123199

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T12:17:10Z

First, there are tell tale signs of an evaluation that was interrupted. There are two subdirectories under @/mnt/wres_share/evaluations/wres_job_8078728001050088409@:

[Hank@nwcal-wres-prod02 wres_job_8078728001050088409]$ ls -lrt
total 6780
drwxrws---. 2       498 wres       6 Nov 17 15:28 wres_evaluation_GSYH12k2aSgRCY6PrESfl1dYRFc
drwxrws---. 2       498 wres   24576 Nov 18 07:37 wres_evaluation_ZqE_-C1moTBKiaIGjPC53FdXllQ
-rw-rw-r--. 1 Hank wres 6910657 Nov 21 11:57 wres_job_8078728001050088409.tgz

The tar is something I prepared for Anna in order to send her the output. Should have made that clear in the description: the user already having their output is why I chose to report this as a developer ticket, not in user support.

The other sign of two runs: the stdout logging shows the usual overwrite affect. What is odd is that the second run appears to have generated less stdout logging than the first.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T12:19:33Z

That oddity allowed me to see why the first run failed: there was a database connection issue, probably because of weekend backups. I forgot to warn Anna about that. The portion of the exception that I can see in the stdout logging is below.

Next step: figure out why the output list was not obtained. I'm wondering if it has to do with the output queue watcher. I'll look in the tasker log,

Hank

====================================================

    ... 8 common frames omitted
Caused by: java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:554)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-17T22:46:18.881+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@2267ea8
2023-11-17T22:46:28.892+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-17T22:46:29.893+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@2267ea8
2023-11-17T22:46:39.902+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-17T22:46:40.903+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@2267ea8
2023-11-17T22:46:50.906+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.SocketTimeoutException: Connect timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-17T22:46:51.907+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@2267ea8
2023-11-17T22:46:55.644+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: Connection to nwcal-wresdb-prod01.[host]:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:554)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-17T22:46:56.644+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@2267ea8
2023-11-17T22:46:56.645+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: Connection to nwcal-wresdb-prod01.[host]:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted
Caused by: java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:554)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at org.postgresql.core.PGStream.createSocket(PGStream.java:243)
    at org.postgresql.core.PGStream.<init>(PGStream.java:98)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    ... 16 common frames omitted
2023-11-17T22:46:57.646+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@2267ea8
2023-11-17T22:46:57.669+0000 WARN DatabaseLockManagerPostgres Exception while managing connections: 
java.lang.IllegalStateException: Could not get a raw database connection.
    at wres.io.database.Database.getRawConnection(Database.java:186)
    at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
    at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: FATAL: the database system is starting up
    at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693)
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203)
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
    at org.postgresql.Driver.makeConnection(Driver.java:443)
    at org.postgresql.Driver.connect(Driver.java:297)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
    at wres.io.database.Database.getRawConnection(Database.java:331)
    at wres.io.database.Database.getRawConnection(Database.java:182)
    ... 8 common frames omitted

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T12:36:35Z

Here is the start of the job:

2023-11-17T15:28:08.047+0000 [qtp1777381620-13166] INFO wres.tasker.WresJob - ==========> REQUEST POSTED: verb = 'execute'; postInput = false; additional arguments = '[]'.
2023-11-17T15:28:08.061+0000 [JobResults Thread 204] INFO wres.tasker.JobResults - Watching the queue job.8078728001050088409.STDOUT for STDOUT logging.
2023-11-17T15:28:08.062+0000 [JobResults Thread 202] INFO wres.tasker.JobOutputWatcher - Watching the queue job.8078728001050088409.output for job output.
2023-11-17T15:28:08.070+0000 [JobResults Thread 205] INFO wres.tasker.JobStatusWatcher - Watching the queue job.8078728001050088409.status for status information on the evaluation.
2023-11-17T15:28:08.071+0000 [JobResults Thread 205] INFO wres.tasker.JobStatusWatcher - Now waiting for messages in the queue job.8078728001050088409.status.
2023-11-17T15:28:08.073+0000 [JobResults Thread 201] INFO wres.tasker.JobResults - Watching queue job.8078728001050088409.exitCode for the exit code result of the evaluation.
2023-11-17T15:28:08.073+0000 [JobResults Thread 201] INFO wres.tasker.JobResults - Looking for exit code on topic job.8078728001050088409.exitCode
2023-11-17T15:28:08.106+0000 [JobResults Thread 203] INFO wres.tasker.JobResults - Watching the queue job.8078728001050088409.STDERR for STDERR logging.
2023-11-17T15:28:08.107+0000 [qtp1777381620-13166] INFO wres.tasker.WresJob - Sent a message to queue 'wres.job' with properties '#contentHeader<basic>(content-type=null, content-encoding=null, headers=null, delivery-mode=2, priority=0, correlation-id=8078728001050088409, reply-to=wres.job.status, expiration=null, message-id=null, timestamp=null, type=null, user-id=null, app-id=null, cluster-id=null)'
2023-11-17T15:28:08.110+0000 [qtp1777381620-13166] INFO wres.tasker.WresJob - For verb EXECUTE, the declaration message was sent with job id 8078728001050088409, priority 0, and database host='nwcal-wresdb-prod01.[host]', port='', and name='wres8'. There 0 jobs preceding it in the queue.
2023-11-17T15:28:08.110+0000 [qtp1777381620-13166] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "POST /job HTTP/1.1" 201 172 "-" "python-requests/2.31.0"
2023-11-17T15:28:08.329+0000 [qtp1777381620-13166] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "GET /job/8078728001050088409/stdout HTTP/1.1" 200 20 "-" "python-requests/2.31.0"
2023-11-17T15:28:08.340+0000 [qtp1777381620-13166] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "GET /job/8078728001050088409/status HTTP/1.1" 200 31 "-" "python-requests/2.31.0"
... SNIP ...

Then something broke down around 23:24Z on Nov 17:

...
2023-11-17T23:24:28.355+0000 [RabbitMQ Error On Write Thread] WARN com.rabbitmq.client.impl.ForgivingExceptionHandler - An unexpected connection driver error occurred (Exception message: Connection reset by peer)
2023-11-17T23:24:28.361+0000 [RabbitMQ Error On Write Thread] WARN com.rabbitmq.client.impl.ForgivingExceptionHandler - An unexpected connection driver error occurred (Exception message: Connection reset by peer)
Stopping the wres-tasker. Bye bye...
Starting the wres-tasker...
2023-11-17T23:29:00.330+0000 [main] INFO wres.tasker.WresJob - Admin token read from system properties and hashed successfully.
2023-11-17T23:29:00.473+0000 [main] WARN wres.messages.BrokerHelper - Trusting alternative Certificate Authority at '/wres_secrets/trustedCertificateAuthorities.jks'
2023-11-17T23:29:00.576+0000 [main] INFO wres.tasker.WresJob - Redis host specified: persister, using redis at redis://persister:6379
2023-11-17T23:29:02.610+0000 [main] INFO org.redisson.Version - Redisson 3.18.0
2023-11-17T23:29:02.891+0000 [redisson-netty-2-7] INFO org.redisson.connection.pool.MasterPubSubConnectionPool - 1 connections initialized for persister/172.19.254.137:6379
2023-11-17T23:29:02.976+0000 [redisson-netty-2-51] INFO org.redisson.connection.pool.MasterConnectionPool - 24 connections initialized for persister/172.19.254.137:6379
2023-11-17T23:29:06.996+0000 [main] INFO wres.tasker.JobResults - Found job 8078728001050088409 in the redis map, which has status IN_PROGRESS.  Setting up watchers to watch it.
2023-11-17T23:29:07.821+0000 [main] WARN wres.tasker.JobResults - Timed out while waiting for job feedback watchers to bind for job 8078728001050088409; aborting watching that job.
java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at com.rabbitmq.client.impl.SocketFrameHandlerFactory.create(SocketFrameHandlerFactory.java:61)
    at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:69)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.init(AutorecoveringConnection.java:165)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1242)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1198)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1156)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1321)
    at wres.tasker.JobResults.getConnection(JobResults.java:282)
    at wres.tasker.JobResults.watchForJobFeedback(JobResults.java:912)
    at wres.tasker.JobResults.<init>(JobResults.java:230)
    at wres.tasker.WresJob.<clinit>(WresJob.java:227)
    at wres.tasker.Tasker.main(Tasker.java:134)

The tasker is then shutdown and restarted, but continues to have issues connecting to the broker, so it shutsdown and restarts again:

2023-11-17T23:29:07.917+0000 [main] ERROR wres.tasker.Tasker - Connectivity failure. Shutting down and exiting.
wres.tasker.WresJob$ConnectivityException: Failed to connect to broker at broker:5671
    at wres.tasker.WresJob.getWresJob(WresJob.java:251)
    at wres.tasker.Tasker.main(Tasker.java:138)
Caused by: java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at com.rabbitmq.client.impl.SocketFrameHandlerFactory.create(SocketFrameHandlerFactory.java:61)
    at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:69)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.init(AutorecoveringConnection.java:165)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1242)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1198)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1156)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1321)
    at wres.tasker.WresJob.getWresJob(WresJob.java:238)
    ... 1 common frames omitted
Starting the wres-tasker...
2023-11-17T23:29:10.141+0000 [main] INFO wres.tasker.WresJob - Admin token read from system properties and hashed successfully.
2023-11-17T23:29:10.225+0000 [main] WARN wres.messages.BrokerHelper - Trusting alternative Certificate Authority at '/wres_secrets/trustedCertificateAuthorities.jks'
2023-11-17T23:29:10.235+0000 [main] INFO wres.tasker.WresJob - Redis host specified: persister, using redis at redis://persister:6379
2023-11-17T23:29:10.394+0000 [main] INFO org.redisson.Version - Redisson 3.18.0
2023-11-17T23:29:10.610+0000 [redisson-netty-2-6] INFO org.redisson.connection.pool.MasterPubSubConnectionPool - 1 connections initialized for persister/172.19.254.137:6379
2023-11-17T23:29:10.674+0000 [redisson-netty-2-52] INFO org.redisson.connection.pool.MasterConnectionPool - 24 connections initialized for persister/172.19.254.137:6379
2023-11-17T23:29:11.527+0000 [main] INFO wres.tasker.JobResults - Found job 8078728001050088409 in the redis map, which has status IN_PROGRESS.  Setting up watchers to watch it.
2023-11-17T23:29:12.002+0000 [main] WARN wres.tasker.JobResults - Timed out while waiting for job feedback watchers to bind for job 8078728001050088409; aborting watching that job.
java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at com.rabbitmq.client.impl.SocketFrameHandlerFactory.create(SocketFrameHandlerFactory.java:61)
    at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:69)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.init(AutorecoveringConnection.java:165)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1242)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1198)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1156)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1321)
    at wres.tasker.JobResults.getConnection(JobResults.java:282)
    at wres.tasker.JobResults.watchForJobFeedback(JobResults.java:912)
    at wres.tasker.JobResults.<init>(JobResults.java:230)
    at wres.tasker.WresJob.<clinit>(WresJob.java:227)
    at wres.tasker.Tasker.main(Tasker.java:134)
2023-11-17T23:29:12.199+0000 [main] ERROR wres.tasker.Tasker - Connectivity failure. Shutting down and exiting.
wres.tasker.WresJob$ConnectivityException: Failed to connect to broker at broker:5671
...

I'm assuming those shutdowns and restarts are some sort of Docker healthcheck result? Anyway, this continues a few more times until the tasker is eventually able to reconnect to the broker:

...
Starting the wres-tasker...
2023-11-17T23:30:28.892+0000 [main] INFO wres.tasker.WresJob - Admin token read from system properties and hashed successfully.
2023-11-17T23:30:28.975+0000 [main] WARN wres.messages.BrokerHelper - Trusting alternative Certificate Authority at '/wres_secrets/trustedCertificateAuthorities.jks'
2023-11-17T23:30:28.985+0000 [main] INFO wres.tasker.WresJob - Redis host specified: persister, using redis at redis://persister:6379
2023-11-17T23:30:29.141+0000 [main] INFO org.redisson.Version - Redisson 3.18.0
2023-11-17T23:30:29.353+0000 [redisson-netty-2-6] INFO org.redisson.connection.pool.MasterPubSubConnectionPool - 1 connections initialized for persister/172.19.254.137:6379
2023-11-17T23:30:29.417+0000 [redisson-netty-2-52] INFO org.redisson.connection.pool.MasterConnectionPool - 24 connections initialized for persister/172.19.254.137:6379
2023-11-17T23:30:30.249+0000 [main] INFO wres.tasker.JobResults - Found job 8078728001050088409 in the redis map, which has status IN_PROGRESS.  Setting up watchers to watch it.
2023-11-17T23:30:30.998+0000 [JobResults Thread 3] INFO wres.tasker.JobOutputWatcher - Watching the queue job.8078728001050088409.output for job output.
2023-11-17T23:30:30.998+0000 [JobResults Thread 2] INFO wres.tasker.JobResults - Watching the queue job.8078728001050088409.STDERR for STDERR logging.
2023-11-17T23:30:30.999+0000 [JobResults Thread 4] INFO wres.tasker.JobStatusWatcher - Watching the queue job.8078728001050088409.status for status information on the evaluation.
2023-11-17T23:30:30.999+0000 [JobResults Thread 5] INFO wres.tasker.JobResults - Watching queue job.8078728001050088409.exitCode for the exit code result of the evaluation.
2023-11-17T23:30:31.000+0000 [JobResults Thread 1] INFO wres.tasker.JobResults - Watching the queue job.8078728001050088409.STDOUT for STDOUT logging.
2023-11-17T23:30:31.028+0000 [JobResults Thread 4] INFO wres.tasker.JobStatusWatcher - Now waiting for messages in the queue job.8078728001050088409.status.
2023-11-17T23:30:31.033+0000 [JobResults Thread 5] INFO wres.tasker.JobResults - Looking for exit code on topic job.8078728001050088409.exitCode
2023-11-17T23:30:31.149+0000 [main] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.132:5671
...

But note this log message indicates that it can watch the output queue:

2023-11-17T23:30:30.998+0000 [JobResults Thread 3] INFO wres.tasker.JobOutputWatcher - Watching the queue job.8078728001050088409.output for job output.

Log messages later on confirm that the job was restarted and had to overwrite the earlier logs (I've never noticed these before, but its kinda cool to see confirmation of the stdout overwrites):

2023-11-17T23:31:22.846+0000 [JobResults Thread 1] WARN wres.tasker.JobMetadata - Overwrote stdout index=3 old value=2023-11-17T15:28:10.120+0000 INFO Main WRES version 20231030-dd12c27 with new value=2023-11-17T23:31:22.841+0000 INFO Main WRES version 20231030-dd12c27
2023-11-17T23:31:22.867+0000 [JobResults Thread 1] WARN wres.tasker.JobMetadata - Overwrote stdout index=4 old value=2023-11-17T15:28:10.133+0000 INFO Main Processors: 8; Max Memory: 8192MiB; Free Memory: 8110MiB; Total Memory: 8192MiB; WRES System Settings: SystemSettings[databaseConfiguration=DatabaseSettings[jdbcUrl=<null>,host=nwcal-wresdb-prod01.[host],username=wres_user8,port=5432,databaseName=wres8,databaseType=postgresql,certificateFileToTrust=/wres_secrets/DODSWCA_66.pem,maxPoolSize=13,maxHighPriorityPoolSize=5,maxIdleTime=80,useSSL=true,validateSSL=true,queryTimeout=18000,connectionTimeoutMs=18000000],maximumThreadCount=7,poolObjectLifespan=30000,fetchSize=50000,maximumCopies=10000,netcdfCachePeriod=90,minimumCachedNetcdf=100,maximumCachedNetcdf=200,hardNetcdfCacheLimit=0,netcdfStorePath=systests/data/,maximumReadThreads=7,maximumIngestThreads=10,maximumArchiveThreads=5,maximumWebClientThreads=6,maximumNwmIngestThreads=6,maximumPoolThreads=6,maximumThresholdThreads=6,maximumMetricThreads=3,maximumProductThreads=5,featureBatchThreshold=10,featureBatchSize=1]; Java System Properties: com.redhat.fips; file.encoding=UTF-8; java.class.version=61.0; java.home=/usr/lib/jvm/java-17-openjdk-17.0.9.0.9-2.el8.x86_64; java.io.tmpdir=/mnt/wres_share/evaluations; java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib; java.runtime.name=OpenJDK Runtime Environment; java.runtime.version=17.0.9+9-LTS; java.util.logging.config.class=wres.system.logging.JavaUtilLoggingRedirector; java.util.logging.config.file=logging.properties; java.version=17.0.9; java.version.date=2023-10-17; java.vm.info=mixed mode, sharing; java.vm.name=OpenJDK 64-Bit Server VM; java.vm.vendor=Red Hat, Inc.; java.vm.version=17.0.9+9-LTS; jdk.debug=release; jdk.jfr.repository=/mnt/wres_share/heap_dumps/wres/1d6a2d333a40/2023_11_17_15_28_09_28220; native.encoding=UTF-8; org.jboss.logging.provider=slf4j; os.arch=amd64; os.name=Linux; os.version=3.10.0-1160.102.1.el7.x86_64; ucar.unidata.io.http.httpBufferSize=200000; ucar.unidata.io.http.maxReadCacheSize=200000; user.dir=/opt/wres-worker-20231024-c3ef398; user.home=/container_home; user.language=en; user.name=wres_docker; user.timezone=UTC; wres.attemptToMigrate=false; wres.certificateFileToTrust=/wres_secrets/DODSWCA_66.pem; wres.databaseHost=nwcal-wresdb-prod01.[host]; wres.databaseName=wres8; wres.eventsBrokerAddress=eventsbroker; wres.eventsBrokerPort=5673; wres.externalGraphics=true; wres.featureBatchSize=1; wres.jobId=wres_job_8078728001050088409; wres.maxPoolSize=13; wres.startBroker=false; wres.useDatabase=true; wres.username=wres_user8; wres.wrdsCertificateFileToTrust=/wres_secrets/DODSWCA_60.pem with new value=2023-11-17T23:31:22.863+0000 INFO Main Processors: 8; Max Memory: 8192MiB; Free Memory: 8110MiB; Total Memory: 8192MiB; WRES System Settings: SystemSettings[databaseConfiguration=DatabaseSettings[jdbcUrl=<null>,host=nwcal-wresdb-prod01.[host],username=wres_user8,port=5432,databaseName=wres8,databaseType=postgresql,certificateFileToTrust=/wres_secrets/DODSWCA_66.pem,maxPoolSize=13,maxHighPriorityPoolSize=5,maxIdleTime=80,useSSL=true,validateSSL=true,queryTimeout=18000,connectionTimeoutMs=18000000],maximumThreadCount=7,poolObjectLifespan=30000,fetchSize=50000,maximumCopies=10000,netcdfCachePeriod=90,minimumCachedNetcdf=100,maximumCachedNetcdf=200,hardNetcdfCacheLimit=0,netcdfStorePath=systests/data/,maximumReadThreads=7,maximumIngestThreads=10,maximumArchiveThreads=5,maximumWebClientThreads=6,maximumNwmIngestThreads=6,maximumPoolThreads=6,maximumThresholdThreads=6,maximumMetricThreads=3,maximumProductThreads=5,featureBatchThreshold=10,featureBatchSize=1]; Java System Properties: com.redhat.fips; file.encoding=UTF-8; java.class.version=61.0; java.home=/usr/lib/jvm/java-17-openjdk-17.0.9.0.9-2.el8.x86_64; java.io.tmpdir=/mnt/wres_share/evaluations; java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib; java.runtime.name=OpenJDK Runtime Environment; java.runtime.version=17.0.9+9-LTS; java.util.logging.config.class=wres.system.logging.JavaUtilLoggingRedirector; java.util.logging.config.file=logging.properties; java.version=17.0.9; java.version.date=2023-10-17; java.vm.info=mixed mode, sharing; java.vm.name=OpenJDK 64-Bit Server VM; java.vm.vendor=Red Hat, Inc.; java.vm.version=17.0.9+9-LTS; jdk.debug=release; jdk.jfr.repository=/mnt/wres_share/heap_dumps/wres/ea9f33e18801/2023_11_17_23_31_04_58; native.encoding=UTF-8; org.jboss.logging.provider=slf4j; os.arch=amd64; os.name=Linux; os.version=3.10.0-1160.102.1.el7.x86_64; ucar.unidata.io.http.httpBufferSize=200000; ucar.unidata.io.http.maxReadCacheSize=200000; user.dir=/opt/wres-worker-20231024-c3ef398; user.home=/container_home; user.language=en; user.name=wres_docker; user.timezone=UTC; wres.attemptToMigrate=false; wres.certificateFileToTrust=/wres_secrets/DODSWCA_66.pem; wres.databaseHost=nwcal-wresdb-prod01.[host]; wres.databaseName=wres8; wres.eventsBrokerAddress=eventsbroker; wres.eventsBrokerPort=5673; wres.externalGraphics=true; wres.featureBatchSize=1; wres.jobId=wres_job_8078728001050088409; wres.maxPoolSize=13; wres.startBroker=false; wres.useDatabase=true; wres.username=wres_user8; wres.wrdsCertificateFileToTrust=/wres_secrets/DODSWCA_60.pem
2023-11-17T23:31:22.894+0000 [JobResults Thread 1] WARN wres.tasker.JobMetadata - Overwrote stdout index=5 old value=2023-11-17T15:28:10.158+0000 INFO Main Running function 'execute' with arguments '[label: CNRFC - NWM vs RSA Streamflow...]'. with new value=2023-11-17T23:31:22.889+0000 INFO Main Running function 'execute' with arguments '[label: CNRFC - NWM vs RSA Streamflow...]'.
2023-11-17T23:31:22.895+0000 [JobResults Thread 1] WARN wres.tasker.JobMetadata - Overwrote stdout index=6 old value=2023-11-17T15:28:10.161+0000 INFO Main Process: 28220 with new value=2023-11-17T23:31:22.892+0000 INFO Main Process: 58
2023-11-17T23:31:22.901+0000 [JobResults Thread 1] WARN wres.tasker.JobMetadata - Overwrote stdout index=7 old value=2023-11-17T15:28:10.162+0000 INFO SystemSettings Creating a database connection pool with 13 connections... with new value=2023-11-17T23:31:22.899+0000 INFO SystemSettings Creating a database connection pool with 13 connections...
2023-11-17T23:31:22.906+0000 [JobResults Thread 1] WARN wres.tasker.JobMetadata - Overwrote stdout index=8 old value=2023-11-17T15:28:10.168+0000 INFO HikariDataSource HikariPool-1 - Starting... with new value=2023-11-17T23:31:22.904+0000 INFO HikariDataSource HikariPool-1 - Starting...
...

Later on, the output queue gets deleted prematurely when it receives a @WriteRedisConnectionException@:

...
2023-11-18T03:39:32.246+0000 [JobResults Thread 1] INFO wres.tasker.JobMessageHelper - Still waiting for job 8078728001050088409 STDOUT
2023-11-18T03:39:33.439+0000 [JobResults Thread 3] INFO wres.tasker.JobMessageHelper - Still waiting for job 8078728001050088409 output
2023-11-18T03:39:33.454+0000 [JobResults Thread 2] INFO wres.tasker.JobMessageHelper - Still waiting for job 8078728001050088409 STDERR
2023-11-18T03:40:18.939+0000 [pool-1-thread-5] WARN wres.tasker.JobOutputConsumer - Failed to offer job output message [8, -55, 116, 18, 12, 8, -124, -28, -32, -86, 6, 16, -94, -76, -46, -65, 3, 24, 1] after PT30S, gave up.
2023-11-18T03:40:29.746+0000 [JobResults Thread 3] INFO wres.tasker.JobOutputWatcher - Deleting the queue job.8078728001050088409.output
2023-11-18T03:40:29.846+0000 [JobResults Thread 2] INFO wres.tasker.JobResults - Deleting the queue job.8078728001050088409.STDERR
2023-11-18T03:40:30.986+0000 [JobResults Thread 3] WARN wres.tasker.JobResults - A throwable was received while watching for messages in a queue.
org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Increase connection pool size. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@449914835 [redisClient=[addr=redis://persister:6379], channel=[id: 0x4e760d89, L:/172.19.254.130:56620 ! R:persister/172.19.254.137:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); if value == false then return nil; end; local t, val = struct.unpack('dLc0', value); local expireDate = 92233720368547758; local expireDateScore = redis.call('zscore', KEYS[2], ARGV[2]); if expireDateScore ~= false then expireDate = tonumber(expireDateScore) end; if t ~= 0 then local expireIdle = redis.call('zscore', KEYS[3], ARGV[2]); if expireIdle ~= false then if tonumber(expireIdle) > tonumber(ARGV[1]) then redis.call('zadd', KEYS[3], t + tonumber(ARGV[1]), ARGV[2]); end; expireDate = math.min(expireDate, tonumber(expireIdle)) end; end; if expireDate <= tonumber(ARGV[1]) then return nil; end; local maxSize = tonumber(redis.call('hget', KEYS[5], 'max-size')); if maxSize ~= nil and maxSize ~= 0 then local mode = redis.call('hget', KEYS[5], 'mode'); if mode == false or mode == 'LRU' then redis.call('zadd', KEYS[4], tonumber(ARGV[1]), ARGV[2]); else redis.call('zincrby', KEYS[4], 1, ARGV[2]); end; end; return val; , 5, jobMetadataById, redisson__timeout__set:{jobMetadataById}, redisson__idle__set:{jobMetadataById}, redisson__map_cache__last_access__set:{jobMetadataById}, {jobMetadataById}:redisson_options, 1700278788439, PooledUnsafeDirectByteBuf(ridx: 0, widx: 22, cap: 256)] after 9 retry attempts
    at org.redisson.command.RedisExecutor.checkWriteFuture(RedisExecutor.java:325)
    at org.redisson.command.RedisExecutor.lambda$execute$2(RedisExecutor.java:169)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:877)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
    at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: io.netty.channel.StacklessClosedChannelException: null
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
2023-11-18T03:40:30.989+0000 [JobResults Thread 2] WARN wres.tasker.JobResults - A throwable was received while watching for messages in a queue.
org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Increase connection pool size. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@1357698633 [redisClient=[addr=redis://persister:6379], channel=[id: 0x5b3843bf, L:/172.19.254.130:56614 ! R:persister/172.19.254.137:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); if value == false then return nil; end; local t, val = struct.unpack('dLc0', value); local expireDate = 92233720368547758; local expireDateScore = redis.call('zscore', KEYS[2], ARGV[2]); if expireDateScore ~= false then expireDate = tonumber(expireDateScore) end; if t ~= 0 then local expireIdle = redis.call('zscore', KEYS[3], ARGV[2]); if expireIdle ~= false then if tonumber(expireIdle) > tonumber(ARGV[1]) then redis.call('zadd', KEYS[3], t + tonumber(ARGV[1]), ARGV[2]); end; expireDate = math.min(expireDate, tonumber(expireIdle)) end; end; if expireDate <= tonumber(ARGV[1]) then return nil; end; local maxSize = tonumber(redis.call('hget', KEYS[5], 'max-size')); if maxSize ~= nil and maxSize ~= 0 then local mode = redis.call('hget', KEYS[5], 'mode'); if mode == false or mode == 'LRU' then redis.call('zadd', KEYS[4], tonumber(ARGV[1]), ARGV[2]); else redis.call('zincrby', KEYS[4], 1, ARGV[2]); end; end; return val; , 5, jobMetadataById, redisson__timeout__set:{jobMetadataById}, redisson__idle__set:{jobMetadataById}, redisson__map_cache__last_access__set:{jobMetadataById}, {jobMetadataById}:redisson_options, 1700278788454, PooledUnsafeDirectByteBuf(ridx: 0, widx: 22, cap: 256)] after 9 retry attempts
    at org.redisson.command.RedisExecutor.checkWriteFuture(RedisExecutor.java:325)
    at org.redisson.command.RedisExecutor.lambda$execute$2(RedisExecutor.java:169)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:877)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
    at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: io.netty.channel.StacklessClosedChannelException: null
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
2023-11-18T03:40:30.989+0000 [JobResults Thread 2] WARN wres.tasker.JobResults - A throwable was received while watching for messages in a queue.
org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Increase connection pool size. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@1357698633 [redisClient=[addr=redis://persister:6379], channel=[id: 0x5b3843bf, L:/172.19.254.130:56614 ! R:persister/172.19.254.137:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); if value == false then return nil; end; local t, val = struct.unpack('dLc0', value); local expireDate = 92233720368547758; local expireDateScore = redis.call('zscore', KEYS[2], ARGV[2]); if expireDateScore ~= false then expireDate = tonumber(expireDateScore) end; if t ~= 0 then local expireIdle = redis.call('zscore', KEYS[3], ARGV[2]); if expireIdle ~= false then if tonumber(expireIdle) > tonumber(ARGV[1]) then redis.call('zadd', KEYS[3], t + tonumber(ARGV[1]), ARGV[2]); end; expireDate = math.min(expireDate, tonumber(expireIdle)) end; end; if expireDate <= tonumber(ARGV[1]) then return nil; end; local maxSize = tonumber(redis.call('hget', KEYS[5], 'max-size')); if maxSize ~= nil and maxSize ~= 0 then local mode = redis.call('hget', KEYS[5], 'mode'); if mode == false or mode == 'LRU' then redis.call('zadd', KEYS[4], tonumber(ARGV[1]), ARGV[2]); else redis.call('zincrby', KEYS[4], 1, ARGV[2]); end; end; return val; , 5, jobMetadataById, redisson__timeout__set:{jobMetadataById}, redisson__idle__set:{jobMetadataById}, redisson__map_cache__last_access__set:{jobMetadataById}, {jobMetadataById}:redisson_options, 1700278788454, PooledUnsafeDirectByteBuf(ridx: 0, widx: 22, cap: 256)] after 9 retry attempts
    at org.redisson.command.RedisExecutor.checkWriteFuture(RedisExecutor.java:325)
    at org.redisson.command.RedisExecutor.lambda$execute$2(RedisExecutor.java:169)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:877)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
    at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: io.netty.channel.StacklessClosedChannelException: null
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

So there is a problem writing information to the persister? I think that's what it means. Let me look at the persister logs,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T12:42:05Z

Note the "Asynchronous AOF fsync is taking too long" message in this sequence after the persister is restarted (presumably for the same reason the tasker was restarted):

1:M 17 Nov 2023 23:24:27.969 # Redis is now ready to exit, bye bye...
1:C 17 Nov 2023 23:28:31.472 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 17 Nov 2023 23:28:31.472 # Redis version=6.2.14, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 17 Nov 2023 23:28:31.472 # Configuration loaded
1:M 17 Nov 2023 23:28:31.474 * monotonic clock: POSIX clock_gettime
1:M 17 Nov 2023 23:28:31.476 * Running mode=standalone, port=6379.
1:M 17 Nov 2023 23:28:31.476 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 17 Nov 2023 23:28:31.476 # Server initialized
1:M 17 Nov 2023 23:28:31.476 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 17 Nov 2023 23:28:31.521 * Reading RDB preamble from AOF file...
1:M 17 Nov 2023 23:28:31.521 * Loading RDB produced by version 6.2.14
1:M 17 Nov 2023 23:28:31.521 * RDB age 125986 seconds
1:M 17 Nov 2023 23:28:31.521 * RDB memory usage when created 1191.65 Mb
1:M 17 Nov 2023 23:28:31.521 * RDB has an AOF tail
1:M 17 Nov 2023 23:28:43.430 # Done loading RDB, keys loaded: 689, keys expired: 0.
1:M 17 Nov 2023 23:28:43.430 * Reading the remaining AOF tail...
1:M 17 Nov 2023 23:28:48.305 * DB loaded from append only file: 16.829 seconds
1:M 17 Nov 2023 23:28:48.305 * Ready to accept connections
1:M 18 Nov 2023 01:21:07.045 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
1:M 20 Nov 2023 22:44:18.024 * Starting automatic rewriting of AOF on 15% growth
1:M 20 Nov 2023 22:44:18.045 * Background append only file rewriting started by pid 76970
1:M 20 Nov 2023 22:44:29.363 * AOF rewrite child asks to stop sending diffs.
76970:C 20 Nov 2023 22:44:29.363 * Parent agreed to stop sending diffs. Finalizing AOF...
76970:C 20 Nov 2023 22:44:29.363 * Concatenating 0.71 MB of AOF diff received from parent.
76970:C 20 Nov 2023 22:44:29.374 * SYNC append only file rewrite performed
76970:C 20 Nov 2023 22:44:29.388 * AOF rewrite: 8 MB of memory used by copy-on-write
1:M 20 Nov 2023 22:44:29.437 * Background AOF rewrite terminated with success
1:M 20 Nov 2023 22:44:29.439 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
1:M 20 Nov 2023 22:44:29.442 * Background AOF rewrite finished successfully

Maybe the backups on the application machines slowed it down to the point where it broke down? The timing appears to be early Saturday morning.

I'm going to post a chat message to Nikki from ITSG to see if he can confirm the backup timing for the @nwcal-wres-prod02@.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T14:17:59Z

I contacted two folks in ITSG Nikki pointed me to asking about when the backups occurred.

Two problem occurred with the long run:

1. The tasker lost connection with the broker and went through a restart cycle until it was able to connect again. That starts with these messages:

2023-11-17T23:29:07.821+0000 [main] WARN wres.tasker.JobResults - Timed out while waiting for job feedback watchers to bind for job 8078728001050088409; aborting watching that job.
2023-11-17T23:29:07.917+0000 [main] ERROR wres.tasker.Tasker - Connectivity failure. Shutting down and exiting.
wres.tasker.WresJob$ConnectivityException: Failed to connect to broker at broker:5671

The restart cycle ends at

2023-11-17T23:30:31.149+0000 [main] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.132:5671

That's about 38 seconds of not being able to talk to the broker.

2. The output queue watcher received an exception and the output queue was prematurely removed. That occurs hours later:

2023-11-18T03:40:18.939+0000 [pool-1-thread-5] WARN wres.tasker.JobOutputConsumer - Failed to offer job output message [8, -55, 116, 18, 12, 8, -124, -28, -32, -86, 6, 16, -94, -76, -46, -65, 3, 24, 1] after PT30S, gave up.
2023-11-18T03:40:29.746+0000 [JobResults Thread 3] INFO wres.tasker.JobOutputWatcher - Deleting the queue job.8078728001050088409.output
2023-11-18T03:40:29.846+0000 [JobResults Thread 2] INFO wres.tasker.JobResults - Deleting the queue job.8078728001050088409.STDERR
2023-11-18T03:40:30.986+0000 [JobResults Thread 3] WARN wres.tasker.JobResults - A throwable was received while watching for messages in a queue.

Nothing is reported in the persister log for the time of the output queue deletion.

I don't have an explanation for either occurrence, and the backup timing may be barking up the wrong tree.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T14:27:20Z

I confirmed that all containers lost their connection to the broker at about the same time, and all containers were restarted. Here is @deployment_worker_1@ on the entry machine:

2023-11-17T23:28:43.478+0000 [main] INFO wres.worker.Worker - Using broker at host 'broker', vhost 'wres', port '5671'
2023-11-17T23:28:48.020+0000 [main] WARN wres.messages.BrokerHelper - Trusting alternative Certificate Authority at '/wres_secrets/trustedCertificateAuthorities.jks'
2023-11-17T23:28:49.043+0000 [main] ERROR wres.worker.Worker - Checked exception while talking to the broker
java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at com.rabbitmq.client.impl.SocketFrameHandlerFactory.create(SocketFrameHandlerFactory.java:61)
    at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:69)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.init(AutorecoveringConnection.java:165)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1242)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1198)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1156)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1321)
    at wres.worker.Worker.main(Worker.java:90)
Exception in thread "main" java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at com.rabbitmq.client.impl.SocketFrameHandlerFactory.create(SocketFrameHandlerFactory.java:61)
    at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:69)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.init(AutorecoveringConnection.java:165)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1242)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1198)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1156)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:1321)
    at wres.worker.Worker.main(Worker.java:90)
Starting the wres-worker client...
2023-11-17T23:28:53.356+0000 [main] INFO wres.worker.Worker - Using broker at host 'broker', vhost 'wres', port '5671'
2023-11-17T23:28:53.610+0000 [main] WARN wres.messages.BrokerHelper - Trusting alternative Certificate Authority at '/wres_secrets/trustedCertificateAuthorities.jks'
2023-11-17T23:28:53.677+0000 [main] ERROR wres.worker.Worker - Checked exception while talking to the broker

Checking the broker logs,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T14:34:20Z

The broker received a SIGTERM:

...
2023-11-17 23:20:11.283178+00:00 [info] <0.24916.7> connection <0.24916.7> (172.19.254.134:38502 -> 172.19.254.136:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2023-11-17 23:20:11.283711+00:00 [info] <0.24916.7> closing AMQP connection <0.24916.7> (172.19.254.134:38502 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
^[[38;5;87m2023-11-17 23:24:27.912540+00:00 [notice] <0.61.0> SIGTERM received - shutting down^[[0m
^[[38;5;87m2023-11-17 23:24:27.912540+00:00 [notice] <0.61.0> ^[[0m
^[[38;5;214m2023-11-17 23:24:28.086184+00:00 [warning] <0.815.0> closing AMQP connection <0.815.0> (10.3.10.104:35310 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-worker'):^[[0m
^[[38;5;214m2023-11-17 23:24:28.086184+00:00 [warning] <0.815.0> client unexpectedly closed TCP connection^[[0m
^[[38;5;214m2023-11-17 23:24:28.120742+00:00 [warning] <0.588.0> HTTP listener registry could not find context rabbitmq_prometheus_tls^[[0m
^[[38;5;214m2023-11-17 23:24:28.131157+00:00 [warning] <0.588.0> HTTP listener registry could not find context rabbitmq_management_tcp^[[0m
2023-11-17 23:24:28.146541+00:00 [info] <0.230.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping unregistration.^[[0m
2023-11-17 23:24:28.146719+00:00 [info] <0.747.0> stopped TLS (SSL) listener on [::]:5671^[[0m
^[[38;5;160m2023-11-17 23:24:28.147532+00:00 [error] <0.806.0> Error on AMQP connection <0.806.0> (10.3.10.104:38502 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-worker', state: running), channel 0:^[[0m
^[[38;5;160m2023-11-17 23:24:28.147532+00:00 [error] <0.806.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"^[[0m
^[[38;5;160m2023-11-17 23:24:28.147528+00:00 [error] <0.780.0> Error on AMQP connection <0.780.0> (172.19.254.137:39366 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-worker', state: running), channel 0:^[[0m
^[[38;5;160m2023-11-17 23:24:28.147528+00:00 [error] <0.780.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"^[[0m
^[[38;5;160m2023-11-17 23:24:28.147518+00:00 [error] <0.774.0> Error on AMQP connection <0.774.0> (172.19.254.135:49968 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-worker', state: running), channel 0:^[[0m
^[[38;5;160m2023-11-17 23:24:28.147518+00:00 [error] <0.774.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"^[[0m
^[[38;5;160m2023-11-17 23:24:28.147658+00:00 [error] <0.2275.0> Error on AMQP connection <0.2275.0> (172.19.254.134:34802 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-tasker', state: running), channel 0:^[[0m
^[[38;5;160m2023-11-17 23:24:28.147658+00:00 [error] <0.2275.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"^[[0m
^[[38;5;160m2023-11-17 23:24:28.147613+00:00 [error] <0.818.0> Error on AMQP connection <0.818.0> (10.3.10.104:48722 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-worker', state: running), channel 0:^[[0m
^[[38;5;160m2023-11-17 23:24:28.147613+00:00 [error] <0.818.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"^[[0m
^[[38;5;160m2023-11-17 23:24:28.147653+00:00 [error] <0.2252.0> Error on AMQP connection <0.2252.0> (172.19.254.134:34800 -> 172.19.254.136:5671, vhost: 'wres', user: 'wres-tasker', state: running), channel 0:^[[0m
^[[38;5;160m2023-11-17 23:24:28.147653+00:00 [error] <0.2252.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"^[[0m
2023-11-17 23:24:28.175278+00:00 [info] <0.27017.7> Closing all connections in vhost 'wres' on node 'rabbit@localhost' because the vhost is stopping^[[0m
2023-11-17 23:24:28.175327+00:00 [info] <0.27018.7> Closing all connections in vhost '/' on node 'rabbit@localhost' because the vhost is stopping^[[0m
...

And eventually it was restarted:

  Starting broker...2023-11-17 23:29:38.915847+00:00 [info] <0.230.0> ^[[0m

So there was a breakdown at around 2023-11-17 23:24 Z. What would have sent a SIGTERM signal to the broker?

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T14:39:23Z

Kill sends a SIGTERM. Not sure what else does.

So at about 6:24pm EST on Friday, a kill signal or something equivalent was sent to the broker. Why?

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T14:40:52Z

Anyway, that was for itam 1 in #122858-5. Let me look more closely at item 2.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-21T14:41:47Z

This will be the docker stop command issued on an unhealthy container, which produces a sigterm. A sigterm is the default friendly request from one process to another to terminate, unlike a request from a user (sigint) or kernel (e.g., sigkill).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-21T14:42:32Z

( The broker has a docker healthcheck enabled. )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-21T14:42:55Z

One of these conditions will have been met:

rabbitmqctl eval '
{ true, rabbit_app_booted_and_running } = { rabbit:is_booted(node()), rabbit_app_booted_and_running },
{ [], no_alarms } = { rabbit:alarms(), no_alarms },
[] /= rabbit_networking:active_listeners(),
rabbitmq_node_is_healthy.
' || exit 1

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T14:49:37Z

James wrote:

This will be the docker stop command issued on an unhealthy container, which produces a sigterm. A sigterm is the default friendly request from one process to another to terminate, unlike a request from a user (sigterm) or kernel (e.g., sigkill).

Oh. So a health check on the broker failed. I was starting to wonder if I had hopped on in a daze last Friday and ran @kill@ on it somehow. :)

It appears Docker, itself, was shut down and restarted; this is from @sudo /usr/bin/journalctl -u docker@:

-- Logs begin at Fri 2023-11-17 23:26:28 UTC, end at Tue 2023-11-21 14:44:29 UTC. --
Nov 17 23:27:39 nwcal-wres-prod02.[host] systemd[1]: Starting Docker Application Container Engine...
Nov 17 23:27:42 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:27:42.570285471Z" level=info msg="Starting up"
Nov 17 23:27:45 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:27:45.580695683Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
Nov 17 23:28:11 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:11.237747403Z" level=info msg="Loading containers: start."
Nov 17 23:28:11 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:11.691019219Z" level=info msg="Firewalld: docker zone already exists, returning"
Nov 17 23:28:13 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:13.270881632Z" level=info msg="Firewalld: interface br-990d08bf2e8d already part of docker zone, returning"
Nov 17 23:28:13 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:13.487186965Z" level=info msg="Firewalld: interface br-b7f97a3b02d8 already part of docker zone, returning"
Nov 17 23:28:14 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:14.116061002Z" level=info msg="Firewalld: interface br-d203097415d7 already part of docker zone, returning"
Nov 17 23:28:14 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:14.345217804Z" level=info msg="Firewalld: interface docker0 already part of docker zone, returning"
Nov 17 23:28:15 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:15.043053521Z" level=info msg="Firewalld: interface docker0 already part of docker zone, returning"
Nov 17 23:28:29 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:29.265790335Z" level=info msg="Loading containers: done."
Nov 17 23:28:29 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:29.408569455Z" level=info msg="Docker daemon" commit=311b9ff graphdriver=overlay2 version=24.0.7
Nov 17 23:28:29 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:29.408920932Z" level=info msg="Daemon has completed initialization"
Nov 17 23:28:29 nwcal-wres-prod02.[host] dockerd[2311]: time="2023-11-17T23:28:29.839594986Z" level=info msg="API listen on /run/docker.sock

Well, look at that:

[Hank@nwcal-wres-prod02 ~]$ last reboot
reboot   system boot  3.10.0-1160.102. Fri Nov 17 23:26 - 14:47 (3+15:20)   
reboot   system boot  3.10.0-1160.102. Wed Nov  8 22:30 - 14:47 (12+16:16)  
reboot   system boot  3.10.0-1160.102. Wed Nov  1 21:49 - 14:47 (19+16:57)  

wtmp begins Wed Nov  1 06:55:34 2023

Per the email last week:

Good Afternoon, ITSG will be migrating patching days this week & next week. Patching will be done on Friday 11/17 for (Dev and Test) & Monday 11/20 for (Prod and Infra). The weekly detailed list will be sent out as normal.

Production was not supposed to be rebooted. I just spent a couple of hours of my time investigating a problem that was not a problem, but a premature system reboot. Thanks, ITSG!

Ugh,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-21T15:02:20Z

HEFS meeting...

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T15:07:53Z

Turns out the reboots were buried in the Linux Change History document attached to the 11/17 maintenance email. I ignored that, because I figured the production reboots would not happen until Monday. Looks like I was wrong.

In the HEFS meeting, but will continue to investigate issue 2 later.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T15:21:49Z

Reflecting reality,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T15:29:07Z

For 2, the problem reported is related to the connection pool size:

org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Increase connection pool size. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@449914835 [redisClient=[addr=redis://persister:6379], channel=[id: 0x4e760d89, L:/172.19.254.130:56620 ! R:persister/172.19.254.137:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); if value == false then return nil; end; local t, val = struct.unpack('dLc0', value); local expireDate = 92233720368547758; local expireDateScore = redis.call('zscore', KEYS[2], ARGV[2]); if expireDateScore ~= false then expireDate = tonumber(expireDateScore) end; if t ~= 0 then local expireIdle = redis.call('zscore', KEYS[3], ARGV[2]); if expireIdle ~= false then if tonumber(expireIdle) > tonumber(ARGV[1]) then redis.call('zadd', KEYS[3], t + tonumber(ARGV[1]), ARGV[2]); end; expireDate = math.min(expireDate, tonumber(expireIdle)) end; end; if expireDate <= tonumber(ARGV[1]) then return nil; end; local maxSize = tonumber(redis.call('hget', KEYS[5], 'max-size')); if maxSize ~= nil and maxSize ~= 0 then local mode = redis.call('hget', KEYS[5], 'mode'); if mode == false or mode == 'LRU' then redis.call('zadd', KEYS[4], tonumber(ARGV[1]), ARGV[2]); else redis.call('zincrby', KEYS[4], 1, ARGV[2]); end; end; return val; , 5, jobMetadataById, redisson__timeout__set:{jobMetadataById}, redisson__idle__set:{jobMetadataById}, redisson__map_cache__last_access__set:{jobMetadataById}, {jobMetadataById}:redisson_options, 1700278788439, PooledUnsafeDirectByteBuf(ridx: 0, widx: 22, cap: 256)] after 9 retry attempts

Where is the connection pool size controlled for Redis? Looking,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T15:50:15Z

I don't really have an explanation for the @WriteRedisConnectionException@ and connection pool size issue. We appear to be using the default connection pools size of 64. That appears to be have been the only evaluation on-going at the time of the issue. None of the other container logs reveal any issues.

Anyone else have ideas of where to look?

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-21T16:00:13Z

I think the connection pool size is probably a red herring, but I also recall that Jesse looked into this in the past (I don't recall the tickets or magic phrases).

I think this all concerns robustness of the service under sad paths, such as system reboots, right? If so, it is not unimportant, but it is not as important as robustness under normal operation/happy paths, which might be reflected in the priority.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T16:03:36Z

Well, it does not appear to be tied to the reboots, which occurred hours earlier. The lack of detailed logging makes it very hard for me to see what is going on. Specifically, we see a complaint in the tasker log when trying to write to the persister, but the persister includes no meaningful messages at that time.

But, yes, something was not quite right here, and I can't tell what. And while it may happen again, it appears to be a rare occurrence. So there is that.

FYI... Anna posted her evaluation again when she saw that the output was unavailable for her earlier run. That was before I shared the output with her. After lunch, I'll see if I can find that evaluation and see if there are issues with it.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-21T16:08:18Z

Hmmm, OK. Some more logging, then? Unsure what options may be available in the @redis.conf@.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T19:20:25Z

Anna's second posting of the evaluation encountered the same problem. The job is 6999219676122400010. The stdout logging is partial. The status is success.

I have student-teacher conferences coming up, but hope to do some basic research after that. I also have some GUI testing to do.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T19:38:26Z

The same issue with being unable to persist the outputs occurs:

2023-11-20T23:07:04.480+0000 [JobResults Thread 131] INFO wres.tasker.JobMessageHelper - Still waiting for job 6999219676122400010 STDOUT
2023-11-20T23:07:04.801+0000 [JobResults Thread 126] INFO wres.tasker.JobMessageHelper - Still waiting for job 96485800192253237 STDOUT
2023-11-20T23:07:42.014+0000 [pool-1-thread-11] WARN wres.tasker.JobOutputConsumer - Failed to offer job output message [8, -1, 36, 18, 11, 8, -96, -51, -17, -86, 6, 16, -92, -64, -87, 6, 24, 1] after PT30S, gave up.
2023-11-20T23:07:42.213+0000 [pool-1-thread-3] WARN wres.tasker.JobOutputConsumer - Failed to offer job output message [8, -128, 37, 18, 12, 8, -97, -51, -17, -86, 6, 16, -29, -15, -83, -86, 3, 24, 1] after PT30S, gave up.
2023-11-20T23:07:55.845+0000 [JobResults Thread 127] INFO wres.tasker.JobResults - Deleting the queue job.96485800192253237.STDERR
2023-11-20T23:07:55.845+0000 [JobResults Thread 128] INFO wres.tasker.JobOutputWatcher - Deleting the queue job.96485800192253237.output
2023-11-20T23:07:55.945+0000 [JobResults Thread 126] INFO wres.tasker.JobResults - Deleting the queue job.96485800192253237.STDOUT
2023-11-20T23:07:56.144+0000 [JobResults Thread 131] INFO wres.tasker.JobResults - Deleting the queue job.6999219676122400010.STDOUT
2023-11-20T23:07:56.445+0000 [JobResults Thread 134] INFO wres.tasker.JobStatusWatcher - Deleting the queue job.6999219676122400010.status
2023-11-20T23:07:56.645+0000 [JobResults Thread 129] INFO wres.tasker.JobStatusWatcher - Deleting the queue job.96485800192253237.status
2023-11-20T23:07:56.745+0000 [JobResults Thread 133] INFO wres.tasker.JobOutputWatcher - Deleting the queue job.6999219676122400010.output
2023-11-20T23:07:56.745+0000 [JobResults Thread 132] INFO wres.tasker.JobResults - Deleting the queue job.6999219676122400010.STDERR
2023-11-20T23:08:12.014+0000 [pool-1-thread-11] WARN wres.tasker.JobOutputConsumer - Failed to offer job output message [8, -128, 37, 18, 11, 8, -95, -51, -17, -86, 6, 16, -88, -122, -73, 6, 24, 1] after PT30S, gave up.
2023-11-20T23:08:12.213+0000 [pool-1-thread-3] WARN wres.tasker.JobOutputConsumer - Failed to offer job output message [8, -127, 37, 18, 12, 8, -96, -51, -17, -86, 6, 16, -85, -1, -71, -86, 3, 24, 1] after PT30S, gave up.
2023-11-20T23:08:31.186+0000 [JobResults Thread 134] WARN wres.tasker.JobResults - A throwable was received while watching for messages in a queue.
org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Increase connection pool size. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@2139219676 [redisClient=[addr=redis://persister:6379], channel=[id: 0xfc168d54, L:/172.19.254.130:49370 ! R:persister/172.19.254.137:6379], currentCommand=null, usage=1], command: (HGET), params: [redisson_live_object:{043e1336393939323139363736313232343030303130}:wres.tasker.JobMetadata, PooledUnsafeDirectByteBuf(ridx: 0, widx: 11, cap: 256)] after 9 retry attempts
    at org.redisson.command.RedisExecutor.checkWriteFuture(RedisExecutor.java:325)
    at org.redisson.command.RedisExecutor.lambda$execute$2(RedisExecutor.java:169)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:877)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
    at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: io.netty.channel.StacklessClosedChannelException: null
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
2023-11-20T23:08:31.186+0000 [JobResults Thread 131] WARN wres.tasker.JobResults - A throwable was received while watching for messages in a queue.

Why are her evaluations consistently running into this issue, but not the smoke tests?

Regardless, this points to production being in a bad state right now, at least for Anna's evaluations. More in a bit,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T20:58:31Z

Before I present my theory, note that the CNRFC D-Store evaluations generate tons of messages, because they span an entire water year, with NWM v2.2 being available before 9/20, and NWM v3.0 being available after 8/17. Both the 2.2 and 3.0 D-Store directories are parsed for all dates in the year. That means, every time a file is looked for in the 3.0 D-Store prior to 8/17, it becomes part of a warning. Same thing for 2.2 after 9/20. A typical warning looks like this:

2023-11-20T23:06:34.471+0000 WARN NwmTimeSeries Skipping NWM TimeSeries (not found) with reference datetime 2023-02-09T18:00:00Z and profile NwmProfile[blobCount=240,memberCount=1,durationBetweenValidDatetimes=PT1H,isVector=true,timeLabel=f,nwmConfiguration=medium_range,nwmOutputType=channel_rt,nwmSubdirectoryPrefix=medium_range,nwmLocationLabel=conus,isEnsembleLike=true,durationPastMidnight=PT0S,durationBetweenReferenceDatetimes=PT6H] from https://nwcal-dstore.[domain]/nwm/3.0/. The following resources were inspected: [https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f001.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f002.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f003.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f004.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f005.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f006.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f007.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f008.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f009.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f010.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f011.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f012.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f013.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f014.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f015.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f016.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f017.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f018.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f019.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f020.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f021.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f022.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f023.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f024.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f025.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f026.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f027.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f028.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f029.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f030.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f031.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f032.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f033.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f034.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f035.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f036.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f037.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f038.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f039.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f040.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f041.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f042.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f043.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f044.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f045.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f046.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f047.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f048.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f049.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f050.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f051.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f052.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f053.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f054.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f055.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f056.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f057.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f058.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f059.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f060.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f061.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f062.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f063.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f064.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f065.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f066.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f067.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f068.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f069.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f070.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f071.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f072.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f073.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f074.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f075.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f076.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f077.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f078.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f079.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f080.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f081.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f082.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f083.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f084.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f085.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f086.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f087.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f088.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f089.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f090.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f091.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f092.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f093.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f094.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f095.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f096.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f097.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f098.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f099.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f100.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f101.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f102.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f103.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f104.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f105.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f106.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f107.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f108.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f109.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f110.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f111.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f112.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f113.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f114.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f115.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f116.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f117.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f118.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f119.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f120.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f121.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f122.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f123.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f124.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f125.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f126.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f127.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f128.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f129.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f130.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f131.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f132.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f133.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f134.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f135.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f136.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f137.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f138.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f139.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f140.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f141.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f142.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f143.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f144.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f145.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f146.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f147.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f148.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f149.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f150.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f151.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f152.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f153.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f154.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f155.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f156.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f157.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f158.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f159.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f160.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f161.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f162.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f163.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f164.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f165.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f166.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f167.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f168.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f169.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f170.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f171.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f172.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f173.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f174.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f175.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f176.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f177.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f178.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f179.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f180.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f181.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f182.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f183.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f184.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f185.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f186.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f187.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f188.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f189.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f190.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f191.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f192.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f193.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f194.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f195.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f196.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f197.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f198.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f199.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f200.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f201.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f202.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f203.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f204.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f205.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f206.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f207.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f208.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f209.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f210.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f211.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f212.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f213.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f214.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f215.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f216.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f217.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f218.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f219.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f220.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f221.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f222.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f223.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f224.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f225.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f226.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f227.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f228.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f229.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f230.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f231.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f232.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f233.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f234.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f235.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f236.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f237.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f238.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f239.conus.nc, https://nwcal-dstore.[domain]/nwm/3.0/nwm.20230209/medium_range_mem1/nwm.t18z.medium_range.channel_rt_1.f240.conus.nc].

There were two CNRFC evaluations running at the same time: 96485800192253237 and 6999219676122400010.

Bottom line: there are a ton of these messages being stored in the persister. Then, one of the evaluations complete, and it sends the output info back through the broker. The @JobOutputConsumer@ gets the messages and attempts to queue them in its @ArrayBlockingQueue@ with a size of 100 (@private static final int LOCAL_Q_SIZE = 100@ in @JobOutputWatcher@). As the queue grows, the @JobOutputSharer@ wants to "share" the output data with the persister, but it can't because of the pool size: there are just too many stdout messages waiting to be processed that the output cannot be stored in the persister.

Exceptions start flying (e.g., @WriteRedisConnectionException@), watchers start catching those exceptions and closing in their @finally@ clauses, and suddenly there are no longer any queues to receive the messages.

Does this seem plausible? That is, that messages going into the persister were outpacing the rate at which the persister could store them, leading eventually to exceptions and queues being deleted prematurely.

If so, the were there any significant changes to how stdout logging is shared back through the broker as part of 6.16 (not 6.17)?

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T20:59:14Z

A mitigation for this specific case would be to reduce the amount of logging associated with d-store. Perhaps we don't need to see warnings every time a file is looked for but not found.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-21T21:15:29Z

One more interesting note... In the previously shared chunk of the tasker log, 8 queues are closed in quick succession:

2023-11-20T23:07:55.845+0000 [JobResults Thread 127] INFO wres.tasker.JobResults - Deleting the queue job.96485800192253237.STDERR
2023-11-20T23:07:55.845+0000 [JobResults Thread 128] INFO wres.tasker.JobOutputWatcher - Deleting the queue job.96485800192253237.output
2023-11-20T23:07:55.945+0000 [JobResults Thread 126] INFO wres.tasker.JobResults - Deleting the queue job.96485800192253237.STDOUT
2023-11-20T23:07:56.144+0000 [JobResults Thread 131] INFO wres.tasker.JobResults - Deleting the queue job.6999219676122400010.STDOUT
2023-11-20T23:07:56.445+0000 [JobResults Thread 134] INFO wres.tasker.JobStatusWatcher - Deleting the queue job.6999219676122400010.status
2023-11-20T23:07:56.645+0000 [JobResults Thread 129] INFO wres.tasker.JobStatusWatcher - Deleting the queue job.96485800192253237.status
2023-11-20T23:07:56.745+0000 [JobResults Thread 133] INFO wres.tasker.JobOutputWatcher - Deleting the queue job.6999219676122400010.output
2023-11-20T23:07:56.745+0000 [JobResults Thread 132] INFO wres.tasker.JobResults - Deleting the queue job.6999219676122400010.STDERR

That occurs in the @finally@ portion of the tasker queue watchers.

Those log messages are followed by 8 @WriteRedisConnectionException@ messages that are dumped to the tasker log. Note that that type of exception is not caught in the watchers; only @IOException@ and @InterruptedException@ are caught.

The implication: The @WriteRedisConnectionException@ occurs for 8 queues at about the same time, causing each to be deleted. The exception is not caught, so each exception is dumped to stdout and displayed in the tasker logs.

In the persister logs there was an AOF rewrite about 23 minutes before the failures here:

1:M 20 Nov 2023 22:44:18.024 * Starting automatic rewriting of AOF on 15% growth
1:M 20 Nov 2023 22:44:18.045 * Background append only file rewriting started by pid 76970
1:M 20 Nov 2023 22:44:29.363 * AOF rewrite child asks to stop sending diffs.
76970:C 20 Nov 2023 22:44:29.363 * Parent agreed to stop sending diffs. Finalizing AOF...
76970:C 20 Nov 2023 22:44:29.363 * Concatenating 0.71 MB of AOF diff received from parent.
76970:C 20 Nov 2023 22:44:29.374 * SYNC append only file rewrite performed
76970:C 20 Nov 2023 22:44:29.388 * AOF rewrite: 8 MB of memory used by copy-on-write
1:M 20 Nov 2023 22:44:29.437 * Background AOF rewrite terminated with success
1:M 20 Nov 2023 22:44:29.439 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
1:M 20 Nov 2023 22:44:29.442 * Background AOF rewrite finished successfully

I doubt that is related, since the time differential is pretty big, but I'm not sure.

Anyway, my workday is done. I'll continue to look at this tomorrow. My gut tells me its an overloaded Redis persister, with the easiest mitigation being a reduction in D-Store logging, but perhaps there are other solutions that would be more permanent instead of just pushing the problem down the road until someone comes up with another evaluation that yields tons of logging.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-22T12:13:08Z

We have another CNRFC evaluation to look at:

 CNRFC - NWM vs RSA Streamflow Groups       | c7e2664bc4bf8611a9d3893f2734a418 | execute lab | f      | 2023-11-21 19:33:59.69038  | 2023-11-22 03:32:16.811216 |  478.28534726699195

The stdout is complete. No signs of it being dropped and restarted (there were no reboots). The output can be obtained. FYI: The stdout is about 40 MBs uncompressed.

My working theory is still that the persister just got too busy, ran out of connections, and Redisson started throwing exceptions.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-22T12:18:33Z

Sounds right and I am sure this has happened in the past and I am pretty sure that increasing the number of connections was a mitigation that Jesse considered, but I think that only works to a point. You would need to search the historical tickets and redmine doesn't have the best search engine ever.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-22T12:24:33Z

I'm going to grab some of these CNRFC declarations for testing after 6.17 goes out. If my theory is connect and the broker-to-persister process is becoming bottlenecked at some point, causing Redis connections to be asked for quicker than they can be released, then we'll need to be able to reproduce the problem in staging in order to identify the cause of the bottleneck and address it.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-22T12:25:13Z

James wrote:

Sounds right and I am sure this has happened in the past and I am pretty sure that increasing the number of connections was a mitigation that Jesse considered, but I think that only works to a point. You would need to search the historical tickets and redmine doesn't have the best search engine ever.

Thanks, James. I'll probably start doing that research next week. I'll focus on preparing declarations to support reproducing it today and then move on to other work I've been delaying.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-22T12:30:16Z

The logs from the second incident yesterday, and the declarations for the two CNRFC jobs being processed at the same time, are in @/home/ISED/wres/wresTestData/issue122858@. I'm going to scan the tasker log to see if there were any other jobs running at the about the same time and collect those declarations as well. These are 8+ hour jobs, so the window for other evaluations can be big. If there are a lot, I'll just grab a subset.

By the way, it looks like Anna posted the same evaluation twice, because the two declarations are identical.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-22T12:33:37Z

There were no other jobs posted while those two were running. So, to reproduce, post the two declarations back to back and wait to see if the persister becomes bogged down and exceptions start occurring.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-22T12:36:18Z

I'm leaving this on-hold.

The time estimate is allowing for a few hours reproducing the problem in staging and investigating it to identify the bottleneck, a few hours relieving the bottleneck, and some buffer time for it not being that easy.

I'm leaving this in 6.18 since it impacted a user.

Moving on to other things,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-22T15:00:51Z

Related to tickets wherein the same exception occurred. For #121603, I think we pointed fingers at an IT issue causing a breakdown in the evaluation. #94134 was never addressed.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-30T17:23:13Z

Hank:

See if you can reproduce the issue in staging by posting the declaration twice.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-30T18:40:33Z

Using production job 96485800192253237 to obtain the YAML, I posted these two evaluations to staging:

2416545165501633932 4487409139391289243

They were posted at about the same time. Let's see what happens,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-30T21:26:05Z

Evaluations are still in progress. I'll look at them tomorrow.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-12-01T13:07:04Z

The evaluations both succeeded with the output list available and (seemingly) full stdout logging.

I'm moving this ticket to the backlog and lowering the priority. Yes, the persister can still become a bottleneck in theory, but its not easy to reproduce what happened in production with 6.16 using 6.17 staging. That being said, I'm still going to look for that D-Store logging ticket and move it to 6.18 in the hopes that someone can address it soon.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-12-01T13:24:37Z

Modified the title to include the words "persister bottleneck" hopefully making this ticket easier to find in the future.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-01-19T13:59:58Z

Evan:

I'm flagging this ticket, because its noted in the 6.19 planning ticket, and because you added a dependency on the persister to the worker/worker-shim. In a comment above, we noted that IT issues could cause delays or otherwise hinder communications with the persister. What would be the impact of IT issues like that on your changes for a persistent cache? Have you hardened your changes against artificial delays in communication created by questionable IT? Could such delays lead to failures or would the worker/worker-shim just fight through them?

More generally, I wonder if changes made to support he worker-server mode and the persistent cache impacts this ticket or the related #94134. I tend to think not, because this issue and #94134 are more about tasker-persister communications, and your changes are mostly tasker-broker-worker-shim and worker-shim-persister communications. Let me know what you think.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-01-19T14:05:44Z

So, with the persistent chache changes, something important to note is that they will only affect the workflow of a job if we don't get to it normally (Like if a job were to complete before a thread redirect could be established). We are always going to want to default to using just the default stream of information being produced by the job. If the IT slowness is bad enough that they are causing that to fail as well, then things are still going to break, but the persistent chache stuff is more of a backup (We don't rely on it for function, but use it incase the default flow is incorrect, or you want to check a finished job).

To summarize the above, we are not reliant on the persistent cache but rather it only adds to our resiliency

epag commented 3 weeks ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-01-19T14:10:37Z

Also to ticket #94134 , I don't think this will have any affect in the sense that if the tasker-broker-persister chain breaks then the information still isnt going to be presented to the end-user.

We now persist logs from the core from the last deploy that would make sure we keep full logs, but nothing has changed so that a user is more garneted to get them from the perspective of that ticket

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-01-19T14:20:47Z

Thanks. So the persistent cache just adds to resiliency, and problems related to tasker-persister communications, caused in past by IT issues, would still be problems. Your code won't help that and won't make it worse.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-01-19T14:34:27Z

That is correct. I believe that thread allocation is distinct for each redisson client we are connecting to the persister so having the server write stuff wont interfere with the resources of the tasker. If we were being bottlenecked by, like, disk speed of the persister then this could cause more issues, but from my understanding it has been more network related than anything else.

NOAA-OWP / wres

As a user, when I run an evaluation through the COWRES that completes successfully, I expect to be able to see output through the front-end (persister bottleneck) #148