Open epag opened 3 weeks ago
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T16:56:20Z
Log is attached. I ran this on the -ti01 against the wres6 database on nwcal-wresdb-dev01. It was a run immediately after a clean-database.
I'm going to start by looking at Check_MK to see if it tells me anything (though I'm a Check_MK noob). Thanks,
Hank
Original Redmine Comment Author Name: James (James) Original Date: 2021-06-16T17:01:03Z
Hank, I come back to not using -ti01 for general testing, I suppose. It would be good to have a system test machine that is somewhat isolated. I don't know whether you're using the same db as the system tests (I think not, from what you've said before), but this would happen when two instances both need an exclusive lock for destructive changes, I think. Would have to look at that prefix code to be sure.
Original Redmine Comment Author Name: James (James) Original Date: 2021-06-16T17:03:54Z
But perhaps it is #74427. Looks like that was never adequately reproduced, so hard to be sure. Should be easy enough to confirm that no other instances were running, though (including that you didn't start two instances running).
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T17:04:14Z
This may be the first time (besides the original testing) where I've seen this message followed by, perhaps, a successful reconnection: @2021-06-16T15:59:40.341+0000 WARN DatabaseLockManagerPostgres About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@48980632@
It looks like multiple threads attempted the re-acquisition of the lock on that source, which looks like a bug at first glance:
2021-06-16T15:59:40.373+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.378+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.384+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.389+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.395+0000 WARN DatabaseLockManagerPostgres Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
It should only be one.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T17:04:39Z
Can you paste the logfile which shows the thread names, at least for that portion?
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T17:08:59Z
FYI... The script that is running the HEFS evaluations moved on after the error was reported, so the database will not look now like what it looked when the problem occurred.
Jesse: What logs are you referring to? I've shared the WRES log pasted to stdout. Are you looking for the one written to a file under ~/wres_logs? If so, I thought that would look the same as stdout.
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T17:11:34Z
My HEFS evaluation script is done. If you want to attempt the evaluation again to see if it happens, let me know. If you'd rather look at something in the database, first, let me know.
Thanks,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T17:12:12Z
Oh, and, James, this is run on the same server as the system tests, but uses a different database.
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:11:45Z
I'm going to try to do the same run, again, to see what happens,
Hank
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T18:20:29Z
2021-06-16T15:54:13.338+0000 44862 [main] INFO wres.io.reading.SourceLoader - Parsing the declared datasets. Depending on many factors (including dataset size, dataset design, data service implementation, service availability, network bandwidth, network latency, storage bandwidth, storage latency, concurrent evaluations on shared resources, concurrent computation on shared resources) this can take a while...
2021-06-16T15:59:40.341+0000 [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - About to restore a connection to database (2), lost org.postgresql.jdbc.PgConnection@48980632
2021-06-16T15:59:40.373+0000 [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.378+0000 [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.384+0000 [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.389+0000 [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.395+0000 [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Re-attempting to acquire source lock 3156226 on connection 2 in 5ms.
2021-06-16T15:59:40.408+0000 [DatabaseLockManager 1] WARN wres.system.DatabaseLockManagerPostgres - Exception while managing connections:
wres.system.DatabaseLockFailed: Another WRES instance is performing a conflicting function. Failed to lock|unlock with prefix=2, lockName=-3156226, operation=LOCK_EXCLUSIVE
The log file has more information: the thread name and full package/class name.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:21:37Z
Got it. I'll have to extract the from the otherwise quite long log,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:24:10Z
Actually, its not that big, so I just attached the whole thing. Just look for "NCK". The end of the log has part of a run that is on-going.
Thanks,
Hank
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T18:24:49Z
It tried 5 times, waiting 5ms between each try, to re-acquire the supposedly lost advisory lock on a new connection 2, but that failed. Could the original connection not have actually been lost? If it might have not actually been lost, before getting a new connection, calling close on the original might be prudent.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-06-16T18:25:16Z
Oh, I already got it from your logs (edit: and pasted the relevant portion) above, but yes, the full log -paste- attachment is helpful. Thanks!
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-16T18:30:41Z
My second attempt at that specific evaluation went through without a hitch. Really wish I had been able to stop the runs when the problem occurred. Unfortunately, it moved on to the next clean-execute sequence immediately. Oh well.
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-06-17T16:03:17Z
From the dev call... It is likely a bug where something protecting the database from corruption could have been overzealous. It should have been able to recover, but didn't.
Hank
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-20T21:10:51Z
Unfortunately this was able to be tested when I had VirtualBox and now it cannot be tested without, so any potential improvements will be speculative.
Author Name: Hank (Hank) Original Redmine Issue: 93204, https://vlab.noaa.gov/redmine/issues/93204 Original Date: 2021-06-16 Original Assignee: Hank
Hesitant to report this, as it may be a product of my executing the software in stand-alone mode on the -ti01. Here is the declaration:
Here is the pertinent part of the log:
I'll post the complete stdout log in the first comment. I'm wondering if a hiccup database side caused this; I'm just not sure how to check. I'll see what I can find.
Thanks,
Hank