Open epag opened 3 weeks ago
Original Redmine Comment Author Name: James (James) Original Date: 2023-11-08T14:34:58Z
Hank wrote:
Is there any reasonable way for us to adjust the WRES code to handle this situation? If necessary, I can start tracking the amount of time between the different machine reboots, in case its just a matter of increasing some sort of retry-cycle to wait longer for Postgres to come back.
Within the core application, I don't think so. Ultimately, database connections can be lost for a variety of reasons and, once retries have been exhausted, the context doesn't matter.
Likewise, if a postgres server returns an error code that could appear in a variety of contexts, some of which might be recoverable from a service-level or higher perspective and others not, then I don't think there's anything we can do in the core app, we need to look higher up.
Of course, we can always propagate error codes and the service layer has complete control over what it retries.
In theory, we could implement a rule above the service level (but respected by the service) that says "any evaluation that fails during this datetime interval should be retried by default". It would be a bit of a tank-at-knife-fight approach, but I don't see much joy in inspecting error codes from postgres etc. because they likely don't clarify machine reboots as distinct from other server problems.
Original Redmine Comment Author Name: James (James) Original Date: 2023-11-08T14:47:19Z
These are the error codes we have to work with in terms of adding retry resilience:
https://www.postgresql.org/docs/current/errcodes-appendix.html
However, connection behavior is largely managed through Hikari connection pools, other than test connections at server start-up (which means per evaluation in the standalone).
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-09T17:36:12Z
Notes from meeting:
Do we have examples of it occurring in production that we can refer to? If not, perhaps do an experiment next week where I post a long evaluation just prior to the reboots. Report on stdout logging the next day. Is the error code included?
Discussed what kind of failure and will occur and how it will be logged. May need to be orchestrated in production.
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-09T17:42:04Z
I found a complete exception message for a failed evaluation due to reboots with job id 8953870426895545144.
Hank
=================================
2023-11-08T21:38:48.190+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@14a83e56
2023-11-08T21:38:48.202+0000 WARN DatabaseLockManagerPostgres Exception while managing connections:
java.lang.IllegalStateException: Could not get a raw database connection.
at wres.io.database.Database.getRawConnection(Database.java:186)
at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: FATAL: the database system is shutting down
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
at wres.io.database.Database.getRawConnection(Database.java:331)
at wres.io.database.Database.getRawConnection(Database.java:182)
... 8 common frames omitted
2023-11-08T21:38:49.203+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@14a83e56
2023-11-08T21:38:49.215+0000 WARN DatabaseLockManagerPostgres Exception while managing connections:
java.lang.IllegalStateException: Could not get a raw database connection.
at wres.io.database.Database.getRawConnection(Database.java:186)
at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: FATAL: the database system is shutting down
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
at wres.io.database.Database.getRawConnection(Database.java:331)
at wres.io.database.Database.getRawConnection(Database.java:182)
... 8 common frames omitted
2023-11-08T21:38:50.216+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@14a83e56
2023-11-08T21:38:50.228+0000 WARN DatabaseLockManagerPostgres Exception while managing connections:
java.lang.IllegalStateException: Could not get a raw database connection.
at wres.io.database.Database.getRawConnection(Database.java:186)
at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: FATAL: the database system is shutting down
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
at wres.io.database.Database.getRawConnection(Database.java:331)
at wres.io.database.Database.getRawConnection(Database.java:182)
... 8 common frames omitted
2023-11-08T21:38:51.228+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@14a83e56
2023-11-08T21:38:51.240+0000 WARN DatabaseLockManagerPostgres Exception while managing connections:
java.lang.IllegalStateException: Could not get a raw database connection.
at wres.io.database.Database.getRawConnection(Database.java:186)
at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: FATAL: the database system is shutting down
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
at wres.io.database.Database.getRawConnection(Database.java:331)
at wres.io.database.Database.getRawConnection(Database.java:182)
... 8 common frames omitted
2023-11-08T21:38:52.241+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (1), lost org.postgresql.jdbc.PgConnection@14a83e56
2023-11-08T21:38:52.253+0000 WARN DatabaseLockManagerPostgres Exception while managing connections:
java.lang.IllegalStateException: Could not get a raw database connection.
at wres.io.database.Database.getRawConnection(Database.java:186)
at wres.io.database.locking.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:823)
at wres.io.database.locking.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1056)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: FATAL: the database system is shutting down
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
at wres.io.database.Database.getRawConnection(Database.java:331)
at wres.io.database.Database.getRawConnection(Database.java:182)
... 8 common frames omitted
2023-11-08T21:38:52.527+0000 WARN DatabaseLockManagerPostgres Unable to unlock remaining database locks.
org.postgresql.util.PSQLException: This connection has been closed.
at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:993)
at org.postgresql.jdbc.PgConnection.createStatement(PgConnection.java:1819)
at org.postgresql.jdbc.PgConnection.createStatement(PgConnection.java:518)
at wres.io.database.locking.DatabaseLockManagerPostgres.releaseSingleLock(DatabaseLockManagerPostgres.java:1227)
at wres.io.database.locking.DatabaseLockManagerPostgres.pgUnlockShared(DatabaseLockManagerPostgres.java:757)
at wres.io.database.locking.DatabaseLockManagerPostgres.shutdown(DatabaseLockManagerPostgres.java:152)
at wres.pipeline.Evaluator.evaluate(Evaluator.java:316)
at wres.pipeline.Evaluator.evaluate(Evaluator.java:125)
at wres.Functions.execute(Functions.java:282)
at wres.Functions.call(Functions.java:151)
at wres.Main.completeExecution(Main.java:236)
at wres.Main.main(Main.java:136)
2023-11-08T21:38:52.527+0000 WARN DatabaseLockManagerPostgres WRES did not clean up by unlocking these shared locks: [1]
2023-11-08T21:38:52.528+0000 INFO Functions The function 'execute' took PT5H33M57.459932344S
2023-11-08T21:41:40.982+0000 ERROR Main Operation 'execute' completed unsuccessfully
wres.pipeline.InternalWresException: Could not complete project execution
at wres.pipeline.Evaluator.evaluate(Evaluator.java:297)
at wres.pipeline.Evaluator.evaluate(Evaluator.java:125)
at wres.Functions.execute(Functions.java:282)
at wres.Functions.call(Functions.java:151)
at wres.Main.completeExecution(Main.java:236)
at wres.Main.main(Main.java:136)
Caused by: wres.pipeline.WresProcessingException: Encountered an error while processing evaluation 'uDTOcmmq1EvLbUik71XXs9nvzg4':
at wres.pipeline.EvaluationUtilities.evaluate(EvaluationUtilities.java:306)
at wres.pipeline.Evaluator.evaluate(Evaluator.java:271)
... 5 common frames omitted
Caused by: wres.pipeline.WresProcessingException: Project failed to complete with the following error:
at wres.pipeline.EvaluationUtilities.evaluate(EvaluationUtilities.java:608)
at wres.pipeline.EvaluationUtilities.evaluate(EvaluationUtilities.java:250)
... 6 common frames omitted
Caused by: java.util.concurrent.CompletionException: wres.io.retrieving.DataAccessException: Failed to access the time-series data.
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: wres.io.retrieving.DataAccessException: Failed to access the time-series data.
at wres.io.retrieving.database.TimeSeriesRetriever.getTimeSeriesFromScript(TimeSeriesRetriever.java:516)
at wres.io.retrieving.database.EnsembleForecastRetriever.get(EnsembleForecastRetriever.java:184)
at wres.io.retrieving.database.EnsembleForecastRetriever.get(EnsembleForecastRetriever.java:30)
at wres.pipeline.pooling.PoolSupplier.createPool(PoolSupplier.java:231)
at wres.pipeline.pooling.PoolSupplier.get(PoolSupplier.java:201)
at wres.pipeline.pooling.PoolSupplier.get(PoolSupplier.java:79)
at wres.pipeline.pooling.PoolFactory$SupplierWithPoolRequest.get(PoolFactory.java:2398)
at wres.pipeline.pooling.PoolProcessor.get(PoolProcessor.java:130)
at wres.pipeline.pooling.PoolProcessor.get(PoolProcessor.java:56)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
... 3 common frames omitted
Caused by: java.sql.SQLException: The Query:
wres.io.database.Query@669ce252[script=SELECT
metadata.series_id AS series_id,
metadata.reference_time + INTERVAL '1' MINUTE * TSV.lead AS valid_time,
metadata.reference_time,
ARRAY_AGG(TSV.series_value ORDER BY TS.ensemble_id) AS ensemble_members,
ARRAY_AGG(TS.ensemble_id ORDER BY TS.ensemble_id) AS ensemble_ids,
metadata.measurementunit_id,
metadata.scale_period,
metadata.scale_function,
metadata.feature_id,
metadata.occurrences
FROM
(
SELECT
S.source_id AS series_id,
MAX( reference_time ) AS reference_time,
S.feature_id,
S.measurementunit_id,
TimeScale.duration_ms AS scale_period,
TimeScale.function_name AS scale_function,
COUNT(*) AS occurrences
FROM wres.Source S
INNER JOIN wres.ProjectSource PS
ON PS.source_id = S.source_id
INNER JOIN wres.TimeSeriesReferenceTime TSRT
ON TSRT.source_id = S.source_id
LEFT JOIN wres.TimeScale TimeScale
ON TimeScale.timescale_id = S.timescale_id
WHERE PS.project_id = ?
AND S.variable_name = ?
AND S.feature_id = ?
AND PS.member = ?
GROUP BY S.source_id,
S.measurementunit_id,
TimeScale.duration_ms,
TimeScale.function_name
) AS metadata
INNER JOIN wres.TimeSeries TS
ON TS.source_id = metadata.series_id
INNER JOIN wres.TimeSeriesValue TSV
ON TSV.timeseries_id = TS.timeseries_id
WHERE TSV.lead > ?
AND TSV.lead <= ?
AND ( EXTRACT( MONTH FROM metadata.reference_time ) > ? OR ( EXTRACT( MONTH FROM metadata.reference_time ) = ? AND EXTRACT( DAY FROM metadata.reference_time ) >= ? ) )
AND ( EXTRACT( MONTH FROM metadata.reference_time ) < ? OR ( EXTRACT( MONTH FROM metadata.reference_time ) = ? AND EXTRACT( DAY FROM metadata.reference_time ) <= ? ) )
GROUP BY metadata.series_id,metadata.reference_time, metadata.feature_id, TSV.lead, metadata.scale_period, metadata.scale_function, metadata.measurementunit_id,metadata.occurrences
,forceTransaction=false,useCursor=true,parameters={5139,SQIN,2081041,right,502200,505080,1,1,1,12,12,31},batchParameters=<null>,sqlStatesToRetry=[],insertedIds=[]]
Failed.
at wres.io.database.Query.call(Query.java:448)
at wres.io.database.Database.buffer(Database.java:276)
at wres.io.database.DataScripter.buffer(DataScripter.java:228)
at wres.io.retrieving.database.TimeSeriesRetriever.getTimeSeriesFromScript(TimeSeriesRetriever.java:466)
... 12 common frames omitted
Caused by: org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2713)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2401)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52)
at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java)
at com.github.marschall.jfr.jdbc.JfrPreparedStatement.executeQuery(JfrPreparedStatement.java:70)
at wres.io.database.Query.callWithParameters(Query.java:709)
at wres.io.database.Query.call(Query.java:410)
... 15 common frames omitted
2023-11-08T21:41:40.983+0000 INFO Main Closing the application...
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-11-09T17:45:22Z
Evan mentioned putting the job into a different state causing it to be requeued. However, James recommended coding in resilience at the app level, not service layer.
Hank
Original Redmine Comment Author Name: James (James) Original Date: 2024-06-28T11:55:38Z
Arguably should be phrased more generally, but there is a lot of NWCAL-centric commentary in this ticket. If we migrate to github, it will need to be rewritten.
Author Name: Hank (Hank) Original Redmine Issue: 122395, https://vlab.noaa.gov/redmine/issues/122395 Original Date: 2023-11-08
The problem: When a Wednesday reboot sequence occurs, the order of the machines going down and coming up is unpredictable. If the Postgres server for an evaluation goes down before the application servers, then on-going evaluations will fail. If the Postgres server comes back after the application servers, then on-going evaluations that were picked back up will fail. In both cases, its because the Postgres server communication is cutoff.
The WRES should be able to pick back up on-going evaluations if the application servers are brought down before Postgres is brought down, and Postgres is brought back up before the application servers are brought back up. I asked for that sequence to be put in place in ITSG-1250:
https://jira.nws.noaa.gov/servicedesk/customer/portal/1/ITSG-1250
However, ITSG reported,
Is there any reasonable way for us to adjust the WRES code to handle this situation? If necessary, I can start tracking the amount of time between the different machine reboots, in case its just a matter of increasing some sort of retry-cycle to wait longer for Postgres to come back.
This ticket can be resolved once an approach is identified and implemented. This ticket can be rejected if there is really nothing we can do.
There are problem old tickets related to this request; I'll find them and relate them later. Thanks,
Hank