As a user, I don't expect to see an exception indicating another WRES instance is performing a conflicting function when my instance is the only instance running - Githubissues

epag commented 3 weeks ago

Author Name: Hank (Hank) Original Redmine Issue: 93204, https://vlab.noaa.gov/redmine/issues/93204 Original Date: 2021-06-16 Original Assignee: Hank

Hesitant to report this, as it may be a product of my executing the software in stand-alone mode on the -ti01. Here is the declaration:

<?xml version='1.0' encoding='UTF-8'?>
<project name="BV Streamflow">

<inputs>
   <left featureDimension="nws_lid" label="OBS Streamflow">
      <type>observations</type>
      <source format="PI-XML">/home/ISED/wres/wresTestData/issue92087/inputs/NCKN6_QME.xml</source>
      <variable>QME</variable>
      <existingTimeScale>
         <function>mean</function>
         <period>24</period>
         <unit>hours</unit>
      </existingTimeScale>
   </left>
   <right label="HEFS Streamflow">
      <type>ensemble forecasts</type>
      <source format="PI-XML">/home/ISED/wres/wresTestData/issue92087/inputs/NCKN6HUD_Streamflow_HEFS_Baseline_Validation.tgz</source>
      <variable>SQIN</variable>
      <removeMemberByValidYear earliestDay="1" earliestMonth="1"/>
   </right>
   <baseline separateMetrics="true" label="ESP Streamflow">
      <type>ensemble forecasts</type>
      <source format="PI-XML">/home/ISED/wres/wresTestData/issue92087/inputs/NCKN6HUD_Streamflow_ESP_Baseline_Validation.tgz</source>
      <variable>SQIN</variable>
      <removeMemberByValidYear earliestDay="1" earliestMonth="1"/>
   </baseline>
</inputs>

<pair label="Daily streamflow">
   <unit>ft3/s</unit>
   <feature left="NCKN6" right="NCKN6HUD" baseline="NCKN6HUD"/>
   <leadHours minimum="18" maximum="720"/>
   <season>
      <earliestMonth>1</earliestMonth>
      <earliestDay>1</earliestDay>
      <latestMonth>12</latestMonth>
      <latestDay>31</latestDay>
   </season>
   <desiredTimeScale>
      <function>mean</function>
      <period>24</period>
      <unit>hours</unit>
      <frequency>24</frequency>
   </desiredTimeScale>
   <leadTimesPoolingWindow>
      <period>24</period>
      <frequency>24</frequency>
      <unit>hours</unit>
   </leadTimesPoolingWindow>
   <crossPair exact="true"/>
</pair>
<metrics>
   <thresholds>
      <type>probability</type>
      <applyTo>left</applyTo>
      <commaSeparatedValues>0.01,0.1,0.5,0.90,0.95,0.99,0.995,0.999</commaSeparatedValues>
      <operator>greater than</operator>
   </thresholds>
   <metric>
      <name>brier score</name>
   </metric>
   <metric>
      <name>brier skill score</name>
   </metric>
   <metric>
      <name>pearson correlation coefficient</name>
   </metric>
   <metric>
      <name>quantile quantile diagram</name>
   </metric>
   <metric>
      <name>bias fraction</name>
   </metric>
   <metric>
      <name>continuous ranked probability score</name>
   </metric>
   <metric>
      <name>continuous ranked probability skill score</name>
   </metric>
   <metric>
      <name>mean absolute error</name>
   </metric>
   <metric>
      <name>mean error</name>
   </metric>
   <metric>
      <name>root mean square error</name>
   </metric>
   <metric>
      <name>box plot of errors by forecast value</name>
   </metric>
   <metric>
      <name>box plot of errors by observed value</name>
   </metric>
   <metric>
      <name>rank histogram</name>
   </metric>
   <metric>
      <name>relative operating characteristic diagram</name>
   </metric>
   <metric>
      <name>relative operating characteristic score</name>
   </metric>
   <metric>
      <name>reliability diagram</name>
   </metric>
   <metric>
      <name>sample size</name>
   </metric>
</metrics>

<outputs durationFormat="hours">
   <destination type="csv2" decimalFormat="0.000">
      <outputType>lead threshold</outputType>
   </destination>
   <destination type="graphic">
      <outputType>lead threshold</outputType>
      <graphical width="800" height="600"/>
  </destination>
</outputs>

</project>
</code>

Here is the pertinent part of the log:

2021-06-16T15:54:13.338+0000 INFO SourceLoader Parsing the declared datasets. Depending on many factors (including dataset size, dataset design, data service implementation, service availability, network bandwidth, network latency, storage bandwidth, storage latency, concurrent evaluations on shared resources, concurrent computation on shared resources) this can take a while...
2021-06-16T15:59:40.341+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@48980632
2021-06-16T15:59:40.373+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.378+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.384+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.389+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.395+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.408+0000 WARN DatabaseLockManagerPostgres Exception while managing connections:
wres.system.DatabaseLockFailed: Another WRES instance is performing a conflicting function. Failed to lock|unlock with prefix=2, lockName=-3156226, operation=LOCK_EXCLUSIVE
        at wres.system.DatabaseLockManagerPostgres.testAndRefresh(DatabaseLockManagerPostgres.java:1045)
        at wres.system.DatabaseLockManagerPostgres$RefreshConnectionsTask.run(DatabaseLockManagerPostgres.java:1093)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

I'll post the complete stdout log in the first comment. I'm wondering if a hiccup database side caused this; I'm just not sure how to check. I'll see what I can find.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T16:56:20Z

Log is attached. I ran this on the -ti01 against the wres6 database on nwcal-wresdb-dev01. It was a run immediately after a clean-database.

I'm going to start by looking at Check_MK to see if it tells me anything (though I'm a Check_MK noob). Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-06-16T17:01:03Z

Hank, I come back to not using -ti01 for general testing, I suppose. It would be good to have a system test machine that is somewhat isolated. I don't know whether you're using the same db as the system tests (I think not, from what you've said before), but this would happen when two instances both need an exclusive lock for destructive changes, I think. Would have to look at that prefix code to be sure.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-06-16T17:03:54Z

But perhaps it is #74427. Looks like that was never adequately reproduced, so hard to be sure. Should be easy enough to confirm that no other instances were running, though (including that you didn't start two instances running).

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T17:04:14Z

This may be the first time (besides the original testing) where I've seen this message followed by, perhaps, a successful reconnection: @2021-06-16T15:59:40.341+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@48980632@

It looks like multiple threads attempted the re-acquisition of the lock on that source, which looks like a bug at first glance:

2021-06-16T15:59:40.373+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.378+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.384+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.389+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.395+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.

It should only be one.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T17:04:39Z

Can you paste the logfile which shows the thread names, at least for that portion?

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T17:08:59Z

FYI... The script that is running the HEFS evaluations moved on after the error was reported, so the database will not look now like what it looked when the problem occurred.

Jesse: What logs are you referring to? I've shared the WRES log pasted to stdout. Are you looking for the one written to a file under ~/wres_logs? If so, I thought that would look the same as stdout.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T17:11:34Z

My HEFS evaluation script is done. If you want to attempt the evaluation again to see if it happens, let me know. If you'd rather look at something in the database, first, let me know.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T17:12:12Z

Oh, and, James, this is run on the same server as the system tests, but uses a different database.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:11:45Z

I'm going to try to do the same run, again, to see what happens,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T18:20:29Z

2021-06-16T15:54:13.338+0000 44862 [main] INFO wres.io.reading.SourceLoader - Parsing the declared datasets. Depending on many factors (including dataset size, dataset design, data service implementation, service availability, network bandwidth, network latency, storage bandwidth, storage latency, concurrent evaluations on shared resources, concurrent computation on shared resources) this can take a while...
2021-06-16T15:59:40.341+0000  [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@48980632
2021-06-16T15:59:40.373+0000  [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.378+0000  [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.384+0000  [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.389+0000  [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.395+0000  [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.408+0000  [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Exception while managing connections:
wres.system.DatabaseLockFailed: Another WRES instance is performing a conflicting function. Failed to lock|unlock with prefix=2, lockName=-3156226, operation=LOCK_EXCLUSIVE

The log file has more information: the thread name and full package/class name.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:21:37Z

Got it. I'll have to extract the from the otherwise quite long log,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:24:10Z

Actually, its not that big, so I just attached the whole thing. Just look for "NCK". The end of the log has part of a run that is on-going.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T18:24:49Z

It tried 5 times, waiting 5ms between each try, to re-acquire the supposedly lost advisory lock on a new connection 2, but that failed. Could the original connection not have actually been lost? If it might have not actually been lost, before getting a new connection, calling close on the original might be prudent.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T18:25:16Z

Oh, I already got it from your logs (edit: and pasted the relevant portion) above, but yes, the full log -paste- attachment is helpful. Thanks!

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:30:41Z

My second attempt at that specific evaluation went through without a hitch. Really wish I had been able to stop the runs when the problem occurred. Unfortunately, it moved on to the next clean-execute sequence immediately. Oh well.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-17T16:03:17Z

From the dev call... It is likely a bug where something protecting the database from corruption could have been overzealous. It should have been able to recover, but didn't.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-20T21:10:51Z

Unfortunately this was able to be tested when I had VirtualBox and now it cannot be tested without, so any potential improvements will be speculative.

NOAA-OWP / wres

As a user, I don't expect to see an exception indicating another WRES instance is performing a conflicting function when my instance is the only instance running #261