Open epag opened 3 months ago
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T15:33:18Z
Jesse wrote:
In @MetricProcessorByTimeEnsemblePairs.validateCategoricalState()@ in commit:9251a4c5ef3cdfb81422d9d4077417f0733e1820 there is the introduction of a loop, I wonder if some of those calls inside the loop are doing allocation.
Edit: this does not look like it shows up in any of the thread dumps or method profiles, probably ain't it.
That only happens once per evaluation (well, strictly, once per @metrics@ declaration edit: and perhaps the wpod evaluation has more than one @metrics@ declaration but, either way, a very small number of times, typically once).
In terms of raw cpu, there is some action in searching the map of thresholds, but it doesn't seem to factor w/r to fraction of samples taken, overall (in other words, when it happens it uses a lot of cpu, but it doesn't represent a high percentage of the total, so probably not an issue).
I think the issue here is memory and stop-the-world gc, that is clear from the heap and gc profile and all the screeching noises in the automated analysis results.
edit: and the main thread completing almost everything by the end, not the pooling threads.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T15:34:15Z
Jesse wrote:
I'm looking at the latest complete jfr (not the total jfr) so my method profile list starts with @Object java.util.HashMap.putVal(int, Object, Object, boolean, boolean) 388 12.419974391805377 %@
I think it's better to look at the merged jfr to get a complete picture (edit: or at least a merged subsample of the evaluation phase). Any one chunk could be biased by one thing or another, although the chunk sizes are not extremely small.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T15:37:54Z
One thing I cannot seem to get is a sample of live objects. I would like to see how the heap is composed. Still learning this whole jfr interface, not proficient yet.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T15:45:36Z
@ThresholdsByMetric@ appears frequently in the stack traces when looking over the top methods that are doing allocations e.g. @HashMap.putVal@, and I see the two methods @getMetrics()@ in @ThresholdsByMetric@ do allocations. So I'm leaning toward "it's a bunch of metrics times a bunch of thresholds" being one of the sources of heap pressure.
I also wonder if the netCDF reader objects were garbage collected at this point.
I also wonder how much allocation is going on in the netCDF writer, but on first glance it seems pretty minimal compared to all the other stuff.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T15:54:43Z
We don't have a JDK in the image, so we don't have a quick/easy way to get a heap dump from a running WRES process. I will either induce one by downloading a Zulu JDK directly into the container, or get one using the Zulu JDK after downloading it into the container.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T15:56:14Z
There are some efficiencies to be had there (such as not rebuilding the -map- set every call), but CPU usage is mostly around 50% throughout the evaluation phase, much lower than the ingest phase.
Big picture, I think the issue is over-allocation due to the feature-batching, which is resulting in a bunch of pools completing very slowly and that work getting punted off to the main thread under the caller runs policy, resulting in an expected overall slow-down. And it never quite oomes.
I would like to see exactly what is on the heap at given instants, but I'm not sure how to get that info, seems like it isn't there.
I think it would also help to have a baseline without feature-batching to rule out other changes related to v5.16.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T15:56:33Z
Jesse wrote:
We don't have a JDK in the image, so we don't have a quick/easy way to get a heap dump from a running WRES process. I will either induce one by downloading a Zulu JDK directly into the container, or get one using the Zulu JDK after downloading it into the container.
Good, thanks, that will help.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T16:03:16Z
Jesse wrote:
Hank wrote:
There is no jmc on the path in -ti03. Ugh. Do we have a wiki somewhere explaining how to make use of those .jfr files? Looking,
https://lmddgtfy.net/?q=how%20to%20open%20a%20jfr%20file
There are many ways to use 'em. I use Zulu Mission Control mostly. But VisualVM should work too. And others.
It was a joke, Hank. Yes, we should have pointers in the wiki to ZMC or VisualVM or whatever.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T16:07:31Z
Hmm, getting a read-only filesystem error despite being able to touch a file on that filesystem (and after downloading successfully and running successfully executables from that downloaded thing).
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T16:10:22Z
Also @./jps@ returns nothing, so perhaps there is also an incompatibility between the zulu jdk and the redhat jdk. Hmm.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T16:12:31Z
Jesse wrote:
Jesse wrote:
Hank wrote:
There is no jmc on the path in -ti03. Ugh. Do we have a wiki somewhere explaining how to make use of those .jfr files? Looking,
https://lmddgtfy.net/?q=how%20to%20open%20a%20jfr%20file
There are many ways to use 'em. I use Zulu Mission Control mostly. But VisualVM should work too. And others.
It was a joke, Hank. Yes, we should have pointers in the wiki to ZMC or VisualVM or whatever.
Its hard to joke with me when I'm barely even paying attention. :) I saw you enter the conversation and used that as an opportunity to focus on other work.
Hank
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T16:15:44Z
Hank wrote:
Its hard to joke with me when I'm barely even paying attention. :) I saw you enter the conversation and used that as an opportunity to focus on other work.
Phew. Just making sure!
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T16:39:19Z
Regardless of tool and regardless of the target filesystem for the hprof, the attach mechanism is blocked by read-only filesystem on the root of the container (edit: or something like that), see https://hg.openjdk.java.net/jdk-updates/jdk11u/file/ebc6c83bfa9e/src/jdk.attach/linux/classes/sun/tools/attach/VirtualMachineImpl.java#l283
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T16:49:58Z
Ok, looks like I may have induced one anyway:
2021-12-14T14:40:36.405+0000 INFO PoolReporter [12419/76960] Completed statistics for a pool in feature group '05317200-4142704'. The time window was: TimeWindowOuter[earliestReferenceTime=2021-05-09T00:00:00Z,latestReferenceTime=2021-07-08T00:00:00Z,earliestValidTime=2021-05-09T00:00:00Z,latestValidTime=2021-07-08T00:00:00Z,earliestLeadDuration=PT108H,latestLeadDuration=PT111H].
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /mnt/wres_share/heap_dumps/wres/java_pid2313.hprof ...
2021-12-14T14:55:27.230+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m2s146ms863µs470ns).
Heap dump file created [4123125335 bytes in 91.479 secs]
#
1. java.lang.OutOfMemoryError: Java heap space
1. -XX:OnOutOfMemoryError="mv /mnt/wres_share/heap_dumps/wres/java_pid%p.hprof /mnt/wres_share/heap_dumps/wres/java_pid%p_`hostname`.hprof; chmod 775 /mnt/wres_share/heap_dumps/wres/java_pid%p_`hostname`.hprof"
1. Executing /bin/sh -c "mv /mnt/wres_share/heap_dumps/wres/java_pid2313.hprof /mnt/wres_share/heap_dumps/wres/java_pid2313_`hostname`.hprof"...
1. Executing /bin/sh -c "chmod 775 /mnt/wres_share/heap_dumps/wres/java_pid2313_`hostname`.hprof"...
Looking for it...
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T16:52:31Z
Yes, it probably happened naturally - I wondered if it would (before we lost patience).
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T16:55:52Z
While awaiting that, I am making some minor improvements to @ThresholdsByMetric@ to avoid repeatedly creating things that are already cached, else adding a cache on construction for the very small things (like the union of metrics). This can go into 5.17. Will connect to this ticket since the analysis was done here. Don't think it is a major part of what we are seeing, but it's worth fixing.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T17:03:27Z
Gzipped it, got it, uploading to the same google drive folder.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T17:14:46Z
!pid2313_top_objects.png!
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T17:26:41Z
Thanks, looking...
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T17:49:56Z
I only see a handful of netCDF reading classes (7) left on the heap, so that's good.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T17:54:06Z
I see most of the @TreeMap@ instances are for reference times. Perhaps a bunch of those can become @EnumMap@ for efficiency.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T17:54:33Z
I think it's pretty clear that my boe calculation in #95867-39 is wrong somehow. I see around 260 MiB of @double[]@ alone on the heap, which is the most rudimentary abstraction of the time-series (ensemble) values. But, how it is wrong, I am not yet sure...
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T17:59:45Z
Jesse wrote:
I see most of the @TreeMap@ instances are for reference times. Perhaps a bunch of those can become @EnumMap@ for efficiency.
Yes, we should do that. This is usually flagged by sq when building a map explicitly for an enum type (edit: that isn't an @EnumMap@), so we are probably constructing it with some helper whose jdk implementation is not easily visible to sq, otherwise we probably would've fixed before now.
edit: an even if it's not being done consistently before being supplied to the @TimeSeriesMetadata@, we can ensure it is done upon construction of the metadata.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T18:03:56Z
Dominators/retained sizes still calculating, but I am not seeing anything unexpected here, except for the amount of time-series data (all of the various abstractions of it) on the heap. So, again, my earlier calculation was off, but I am not sure how. edit: I mean, yes, there is a lot of fluff on top of the raw @double[]@, but that isn't what I mean. I was not expecting 260 MiB of @double[]@ on the heap, regardless of all the other crap that goes along with it.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T18:09:45Z
There are 3,456,184 @double[]@ on the heap. So how do we get to that value? According to #95867-39, I estimated 1,512,000 / 7 = 216,000. Order of magnitude difference.
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T18:14:00Z
Hank, would you mind posting the declaration for #99040-223, absent the features edit: if they are listed explicitly (sorry if I missed it)?
Thanks!
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T18:15:32Z
My paste of screenshot includes non-live objects, so a lot of them may be eligible for gc, this could explain it.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T18:41:17Z
The declaration:
<?xml version='1.0' encoding='UTF-8'?>
<project name="WPOD NWM Medium Range Ensemble Mean vs USGS">
<inputs>
<left label="USGS">
<type>observations</type>
<source interface="usgs_nwis">https://nwis.waterservices.usgs.gov/nwis/iv</source>
<variable>00060</variable>
</left>
<!-- <right label="NWM MR ensemble forecast">
<type>ensemble forecasts</type>
<source interface="wrds_nwm">https://nwcal-wrds.[host]/api/nwm2.1/v2.0/ops/medium_range</source>
<variable>streamflow</variable>
<urlParameter>
<name>lead_time</name>
<value>[1,204]</value>
</urlParameter>
</right>
-->
<right label="NWM Medium Range Ensemble">
<type>ensemble forecasts</type>
<source interface="nwm_medium_range_ensemble_channel_rt_conus">https://nwcal-dstore.[host]/nwm/2.1/</source>
<variable>streamflow</variable>
</right>
</inputs>
<pair label="Pairing for 60 days">
<unit>CFS</unit>
<feature left="01013500" right="724696"/>
...
<feature left="402114105350101" right="13584"/>
<leadHours minimum="0" maximum="240"/>
<dates earliest="2021-05-09T00:00:00Z" latest="2021-07-08T00:00:00Z" />
<issuedDates earliest="2021-05-09T00:00:00Z" latest="2021-07-08T00:00:00Z" />
<leadTimesPoolingWindow>
<period>3</period>
<unit>hours</unit>
</leadTimesPoolingWindow>
</pair>
<metrics>
<thresholds>
<type>value</type>
<applyTo>left</applyTo>
<source format="csv" missingValue="-999.0">/home/ISED/wres/Inputs/nwm_v2.0_1.5yr_recurrence_thresholds.csv</source>
<operator>greater than</operator>
</thresholds>
<thresholds>
<type>value</type>
<applyTo>left</applyTo>
<source format="csv" missingValue="-999.0">/home/ISED/wres/Inputs/thresholdsPercentile.csv</source>
<operator>greater than</operator>
</thresholds>
<thresholds>
<type>value</type>
<applyTo>left</applyTo>
<source format="csv" missingValue="-999.0">/home/ISED/wres/Inputs/thresholdsNWSAction.csv</source>
<operator>greater than</operator>
</thresholds>
<thresholds>
<type>value</type>
<applyTo>left</applyTo>
<source format="csv" missingValue="-999.0">/home/ISED/wres/Inputs/thresholdsNWSMinor.csv</source>
<operator>greater than</operator>
</thresholds>
<thresholds>
<type>value</type>
<applyTo>left</applyTo>
<source format="csv" missingValue="-999.0">/home/ISED/wres/Inputs/thresholdsNWSModerate.csv</source>
<operator>greater than</operator>
</thresholds>
<thresholds>
<type>value</type>
<applyTo>left</applyTo>
<source format="csv" missingValue="-999.0">/home/ISED/wres/Inputs/thresholdsNWSMajor.csv</source>
<operator>greater than</operator>
</thresholds>
<!-- try James threshold suggestion but just for 0.5 -->
<thresholds>
<type>probability classifier</type>
<commaSeparatedValues>0.5</commaSeparatedValues>
</thresholds>
<!-- Metrics are invidiually listed in order to avoid those that
can't be included in NetCDF and aren't output as images.
In other words, those that WRES "drops on the floor".-->
<metric><name>box plot of errors</name></metric>
<metric><name>box plot of percentage errors</name></metric>
<metric><name>bias fraction</name></metric>
<metric><name>brier score</name></metric>
<metric><name>brier skill score</name></metric>
<metric><name>coefficient of determination</name></metric>
<metric><name>contingency table</name></metric>
<metric><name>continuous ranked probability score</name></metric>
<metric><name>equitable threat score</name></metric>
<metric><name>frequency bias</name></metric>
<metric><name>index of agreement</name></metric>
<metric><name>kling gupta efficiency</name></metric>
<metric><name>mean absolute error</name></metric>
<metric><name>mean error</name></metric>
<metric><name>mean square error skill score normalized</name></metric>
<metric><name>mean square error skill score</name></metric>
<metric><name>mean square error</name></metric>
<metric><name>mean</name></metric>
<metric><name>median error</name></metric>
<metric><name>pearson correlation coefficient</name></metric>
<metric><name>peirce skill score</name></metric>
<metric><name>probability of detection</name></metric>
<metric><name>probability of false detection</name></metric>
<metric><name>relative operating characteristic score</name></metric>
<metric><name>root mean square error normalized</name></metric>
<metric><name>root mean square error</name></metric>
<metric><name>sample size</name></metric>
<metric><name>standard deviation</name></metric>
<metric><name>sum of square error</name></metric>
<metric><name>threat score</name></metric>
<metric><name>volumetric efficiency</name></metric>
</metrics>
<outputs durationFormat="hours">
<!-- Because they are not scores, the NetCDF will not include box plots.-->
<destination type="netcdf2">
<outputType>lead threshold</outputType>
<!-- <netcdf vectorVariable="feature_id"/> -->
</destination>
<!-- Only the box-plots are included as images.-->
<!-- Activate this XML when 4.1 is released by uncommenting this section. -->
<destination type="graphic">
<graphical>
<suppressMetric>bias fraction</suppressMetric>
<suppressMetric>brier score</suppressMetric>
<suppressMetric>brier skill score</suppressMetric>
<suppressMetric>coefficient of determination</suppressMetric>
<suppressMetric>contingency table</suppressMetric>
<suppressMetric>continuous ranked probability score</suppressMetric>
<suppressMetric>equitable threat score</suppressMetric>
<suppressMetric>frequency bias</suppressMetric>
<suppressMetric>index of agreement</suppressMetric>
<suppressMetric>kling gupta efficiency</suppressMetric>
<suppressMetric>mean absolute error</suppressMetric>
<suppressMetric>mean error</suppressMetric>
<!-- <suppressMetric>mean square error skill score normalized</suppressMetric> -->
<suppressMetric>mean square error skill score</suppressMetric>
<suppressMetric>mean square error</suppressMetric>
<suppressMetric>mean</suppressMetric>
<suppressMetric>median error</suppressMetric>
<suppressMetric>pearson correlation coefficient</suppressMetric>
<suppressMetric>peirce skill score</suppressMetric>
<suppressMetric>probability of detection</suppressMetric>
<suppressMetric>probability of false detection</suppressMetric>
<suppressMetric>relative operating characteristic score</suppressMetric>
<suppressMetric>root mean square error normalized</suppressMetric>
<suppressMetric>root mean square error</suppressMetric>
<!-- <suppressMetric>sample size</suppressMetric> -->
<suppressMetric>standard deviation</suppressMetric>
<suppressMetric>sum of square error</suppressMetric>
<suppressMetric>threat score</suppressMetric>
<suppressMetric>volumetric efficiency</suppressMetric>
</graphical>
</destination>
</outputs>
</project>
</code>
Hank
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T19:31:41Z
Thanks, Hank. edit: the declaration looks as I expected, no surprises.
A view of the objects by retained size helps.
!heap_retained_sizes.png!
I think one issue is as follows. For efficiency, we acquire the left-ish data once for all pools that belong to the same feature and then cache/re-use it. This applies whether or not we are using climatological data. There are 80 pools per feature in this evaluation and we process 50 features at a time across each of 6 pooling threads. Thus we have up to 50 (features) * 6 (threads) = 300 sets of cached observations and they hang around until all 80 pools are done for those 50 features, so 4,000 pools. Each of these cached set of observations will cover around 60 days. This is why there are so many caching suppliers on the heap that contain references to left-ish data. In other words, things snowball when you combine feature batching with caching of observations and there are a large number of pools per feature. The observations are @Double@, not @double@.
However, that doesn't explain everything. It doesn't explain why there are so many @double[]@ (ensemble forecasts) on the heap.
At this stage, I think it would be good to get a baseline with a feature batch size of 1 because that will take a while to run and we have a lot of instrumentation to help with the dead/oomed evaluation.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T19:47:45Z
James,
I can try to make that happen, though I have a meeting at 3 that may interfere.
Can you provide instructions for the easiest way to get that batch-size-1 run going? Is there an alternative to taking down the service, adding a property, and restarting it?
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T19:51:53Z
Is it, @-Dwres.feature_batch_size=1@ that needs to be added?
Hank
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T19:54:11Z
Thanks. No (edit: in answer to your Q about whether you can avoid restarting), this is a service configuration option, either via the @wresconfig.xml@ or, more straightforwardly, by overriding that option in the compose file.
You would need to set the override as part of the @INNER_JAVA_OPTS@ on the worker service, like this: @-Dwres.featureBatchSize=1@.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T19:56:29Z
Jesse:
I hate to push off my work, but can you make the change to the COWRES? I'll start the run immediately after my 3pm EST meeting.
Thanks,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:04:39Z
Nevermind. The meeting has been cancelled. Let me see if I can make it before I leave in an hour,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:09:33Z
I still see the original evaluation running. Is bringing down the worker containers a sufficiently clean way to kill an on-going evaluation?
Hank
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T20:10:09Z
Hank wrote:
I still see the original evaluation running. Is bring down the worker containers a sufficiently clean way to kill an on-going evaluation?
Hank
I think so. Though this is another issue: why didn't it die? Sigh.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T20:11:09Z
I can also make the change if you want, though I'm making some lunch and I have some gentlemen removing trees from our yard so I'm back and forth.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:15:25Z
Jesse:
I can do it; I have nothing else to do until 4. Also, I think I can cycle the containers pretty quickly, so I can wait a few minutes if you want to take a look. Just let me know.
The job is https://nwcal-wres-ti.[host]/job/4585379827107895713. For posterity, here is the bottom of the stdout logging:
2021-12-14T14:40:36.405+0000 INFO PoolReporter [12419/76960] Completed statistics for a pool in feature group '05317200-4142704'. The time window was: TimeWindowOuter[earliestReferenceTime=2021-05-09T00:00:00Z,latestReferenceTime=2021-07-08T00:00:00Z,earliestValidTime=2021-05-09T00:00:00Z,latestValidTime=2021-07-08T00:00:00Z,earliestLeadDuration=PT108H,latestLeadDuration=PT111H].
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /mnt/wres_share/heap_dumps/wres/java_pid2313.hprof ...
2021-12-14T14:55:27.230+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m2s146ms863µs470ns).
Heap dump file created [4123125335 bytes in 91.479 secs]
#
1. java.lang.OutOfMemoryError: Java heap space
1. -XX:OnOutOfMemoryError="mv /mnt/wres_share/heap_dumps/wres/java_pid%p.hprof /mnt/wres_share/heap_dumps/wres/java_pid%p_`hostname`.hprof; chmod 775 /mnt/wres_share/heap_dumps/wres/java_pid%p_`hostname`.hprof"
1. Executing /bin/sh -c "mv /mnt/wres_share/heap_dumps/wres/java_pid2313.hprof /mnt/wres_share/heap_dumps/wres/java_pid2313_`hostname`.hprof"...
1. Executing /bin/sh -c "chmod 775 /mnt/wres_share/heap_dumps/wres/java_pid2313_`hostname`.hprof"...
2021-12-14T14:55:27.230+0000 WARN HikariPool HikariPool-2 - Thread starvation or clock leap detected (housekeeper delta=53s50ms80µs199ns).
2021-12-14T14:56:42.567+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=2m9s509ms611µs138ns).
2021-12-14T14:57:29.806+0000 WARN HikariPool HikariPool-2 - Thread starvation or clock leap detected (housekeeper delta=2m11s61ms136µs777ns).
2021-12-14T14:57:29.816+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[SubscriberInactivityTimer-1,5,main].
java.lang.OutOfMemoryError: Java heap space
at java.base/java.time.Instant.create(Instant.java:413)
at java.base/java.time.Instant.ofEpochSecond(Instant.java:330)
at java.base/java.time.Clock$SystemClock.instant(Clock.java:558)
at java.base/java.time.Instant.now(Instant.java:273)
at wres.events.EvaluationStatusTracker$1.run(EvaluationStatusTracker.java:603)
at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
at java.base/java.util.TimerThread.run(Timer.java:506)
2021-12-14T14:58:04.139+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m41s465ms159µs34ns).
2021-12-14T15:10:58.992+0000 INFO PoolReporter [12420/76960] Completed statistics for a pool in feature group '05508805-4867221'. The time window was: TimeWindowOuter[earliestReferenceTime=2021-05-09T00:00:00Z,latestReferenceTime=2021-07-08T00:00:00Z,earliestValidTime=2021-05-09T00:00:00Z,latestValidTime=2021-07-08T00:00:00Z,earliestLeadDuration=PT96H,latestLeadDuration=PT99H].
2021-12-14T15:26:03.064+0000 WARN ProxyConnection HikariPool-1 - Connection org.postgresql.jdbc.PgConnection@321a922b marked as broken because of SQLSTATE(08006), ErrorCode(0)
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
at org.postgresql.core.v3.QueryExecutorImpl.fetch(QueryExecutorImpl.java:2565)
at org.postgresql.jdbc.PgResultSet.next(PgResultSet.java:2077)
at com.zaxxer.hikari.pool.HikariProxyResultSet.next(HikariProxyResultSet.java)
at com.github.marschall.jfr.jdbc.JfrResultSet.next(JfrResultSet.java:82)
at wres.io.utilities.SQLDataProvider.next(SQLDataProvider.java:138)
at wres.io.retrieval.TimeSeriesRetriever.getTimeSeriesFromScript(TimeSeriesRetriever.java:254)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:122)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:23)
at wres.io.retrieval.CachingRetriever.lambda$new$0(CachingRetriever.java:56)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:41)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:18)
at wres.io.pooling.PoolSupplier.createPool(PoolSupplier.java:281)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:254)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:83)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.pooling.PoolFactory.lambda$decompose$17(PoolFactory.java:1100)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:237)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:36)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Unexpected packet type: 9
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2516)
at org.postgresql.core.v3.QueryExecutorImpl.fetch(QueryExecutorImpl.java:2560)
... 22 common frames omitted
2021-12-14T15:26:03.072+0000 WARN SQLDataProvider Unable to set connection com.github.marschall.jfr.jdbc.JfrConnection@55ee4f41 to autocommit=true
java.sql.SQLException: Connection is closed
at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:515)
at com.sun.proxy.$Proxy44.setAutoCommit(Unknown Source)
at com.zaxxer.hikari.pool.ProxyConnection.setAutoCommit(ProxyConnection.java:414)
at com.zaxxer.hikari.pool.HikariProxyConnection.setAutoCommit(HikariProxyConnection.java)
at com.github.marschall.jfr.jdbc.JfrConnection.setAutoCommit(JfrConnection.java:275)
at wres.io.utilities.SQLDataProvider.close(SQLDataProvider.java:883)
at wres.io.retrieval.TimeSeriesRetriever.getTimeSeriesFromScript(TimeSeriesRetriever.java:242)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:122)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:23)
at wres.io.retrieval.CachingRetriever.lambda$new$0(CachingRetriever.java:56)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:41)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:18)
at wres.io.pooling.PoolSupplier.createPool(PoolSupplier.java:281)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:254)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:83)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.pooling.PoolFactory.lambda$decompose$17(PoolFactory.java:1100)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:237)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:36)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2021-12-14T15:37:39.599+0000 INFO PoolReporter [12421/76960] Completed statistics for a pool in feature group '05508805-4867221'. The time window was: TimeWindowOuter[earliestReferenceTime=2021-05-09T00:00:00Z,latestReferenceTime=2021-07-08T00:00:00Z,earliestValidTime=2021-05-09T00:00:00Z,latestValidTime=2021-07-08T00:00:00Z,earliestLeadDuration=PT102H,latestLeadDuration=PT105H].
2021-12-14T15:39:53.310+0000 INFO PoolReporter [12422/76960] Completed statistics for a pool in feature group '05508805-4867221'. The time window was: TimeWindowOuter[earliestReferenceTime=2021-05-09T00:00:00Z,latestReferenceTime=2021-07-08T00:00:00Z,earliestValidTime=2021-05-09T00:00:00Z,latestValidTime=2021-07-08T00:00:00Z,earliestLeadDuration=PT108H,latestLeadDuration=PT111H].
2021-12-14T19:12:28.369+0000 ERROR LoggingUncaughtExceptionHandler Uncaught exception in thread "IoRcvr-/172.19.254.199:42258-eventsbroker/172.19.254.194:5673"
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:12:28.370+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[IoRcvr-/172.19.254.199:42258-eventsbroker/172.19.254.194:5673,5,main].
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:12:21.288+0000 ERROR LoggingUncaughtExceptionHandler Uncaught exception in thread "IoRcvr-/172.19.254.199:42270-eventsbroker/172.19.254.194:5673"
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:12:36.766+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[IoRcvr-/172.19.254.199:42270-eventsbroker/172.19.254.194:5673,5,main].
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:12:43.447+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m34s877ms589µs274ns).
2021-12-14T19:12:54.679+0000 ERROR LoggingUncaughtExceptionHandler Uncaught exception in thread "Failover"
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:12:54.679+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[Failover,5,main].
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:16:37.867+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m12s827ms629µs285ns).
2021-12-14T19:16:37.865+0000 ERROR LoggingUncaughtExceptionHandler Uncaught exception in thread "IoRcvr-/172.19.254.199:50440-eventsbroker/172.19.254.194:5673"
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:16:44.309+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[IoRcvr-/172.19.254.199:50440-eventsbroker/172.19.254.194:5673,5,main].
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:16:58.625+0000 ERROR LoggingUncaughtExceptionHandler Uncaught exception in thread "IoRcvr-/172.19.254.199:42262-eventsbroker/172.19.254.194:5673"
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:17:05.730+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[IoRcvr-/172.19.254.199:42262-eventsbroker/172.19.254.194:5673,5,main].
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:18:21.515+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m42s332ms270µs613ns).
2021-12-14T19:18:41.670+0000 ERROR LoggingUncaughtExceptionHandler Uncaught exception in thread "IoRcvr-/172.19.254.199:42254-eventsbroker/172.19.254.194:5673"
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:19:01.910+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[IoRcvr-/172.19.254.199:42254-eventsbroker/172.19.254.194:5673,5,main].
java.lang.OutOfMemoryError: Java heap space
2021-12-14T19:19:01.910+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m35s921ms60µs584ns).
2021-12-14T19:20:59.077+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=48s319ms253µs907ns).
2021-12-14T19:22:11.355+0000 ERROR EvaluationSubscriber Message subscriber gTLSRyGqvr25RYrfgulGUTTtxLI has been flagged as failed without the possibility of recovery.
wres.events.subscribe.UnrecoverableSubscriberException: Encountered an error on connection sRfp9rKDGwQ3_26wTZGR7Dv4B-k owned by subscriber gTLSRyGqvr25RYrfgulGUTTtxLI. If a failover policy was configured on the connection factory (e.g., connection retries), then that policy was exhausted before this error was thrown. As such, the error is not recoverable and the subscriber will now stop.
at wres.events.subscribe.EvaluationSubscriber$ConnectionExceptionListener.onException(EvaluationSubscriber.java:1772)
at org.apache.qpid.client.AMQConnection$2.run(AMQConnection.java:1686)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: javax.jms.JMSException: Exception thrown against AMQConnection:
Host: eventsbroker
Port: 5673
Virtual Host:
Client ID: wres-core
Active session count: 1: org.apache.qpid.AMQDisconnectedException: Server closed connection and no failover was successful
at org.apache.qpid.client.AMQConnection.convertToJMSException(AMQConnection.java:1627)
at org.apache.qpid.client.AMQConnection.closed(AMQConnection.java:1639)
at org.apache.qpid.client.FailoverHandler.run(FailoverHandler.java:97)
at org.apache.qpid.client.AMQConnection.doWithAllLocks(AMQConnection.java:2023)
at org.apache.qpid.client.AMQConnection.doWithAllLocks(AMQConnection.java:2011)
at org.apache.qpid.client.AMQConnection.doWithAllLocks(AMQConnection.java:1990)
at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:288)
... 1 common frames omitted
Caused by: org.apache.qpid.AMQDisconnectedException: Server closed connection and no failover was successful
... 6 common frames omitted
2021-12-14T19:22:26.543+0000 WARN EvaluationSubscriber While closing subscriber gTLSRyGqvr25RYrfgulGUTTtxLI, 1 open evaluations were discovered: [cmMUX7EKjwui91d6FvdCeBeDqe8]. These evaluations will not be notified complete. They should be notified before a subscriber is closed.
2021-12-14T19:23:12.582+0000 ERROR BasicMessageConsumer reNotification : Caught exception (dump follows) - ignoring...
javax.jms.IllegalStateException: Invalid Session
at org.apache.qpid.client.BasicMessageConsumer.checkPreConditions(BasicMessageConsumer.java:878)
at org.apache.qpid.client.BasicMessageConsumer.getMessageListener(BasicMessageConsumer.java:230)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
at java.base/java.lang.Thread.run(Thread.java:829)
2021-12-14T19:23:48.842+0000 WARN AMQConnection Error closing session: javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50436-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
2021-12-14T19:23:56.000+0000 WARN AMQConnection Error closing session: javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50436-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
2021-12-14T19:23:56.000+0000 WARN AMQConnection Error closing connection
javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50436-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
at org.apache.qpid.client.AMQSession.close(AMQSession.java:825)
at org.apache.qpid.client.AMQSession.close(AMQSession.java:787)
at org.apache.qpid.client.AMQConnection.closeAllSessions(AMQConnection.java:1333)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1265)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1252)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1252)
at org.apache.qpid.client.AMQConnection.close(AMQConnection.java:1230)
at org.apache.qpid.client.AMQConnection.close(AMQConnection.java:1209)
at wres.eventsbroker.BrokerConnectionFactory$WrappedConnection.close(BrokerConnectionFactory.java:1024)
at wres.events.subscribe.EvaluationSubscriber.close(EvaluationSubscriber.java:392)
at wres.events.subscribe.EvaluationSubscriber$1.run(EvaluationSubscriber.java:1466)
at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
at java.base/java.util.TimerThread.run(Timer.java:506)
Caused by: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50436-eventsbroker/172.19.254.194:5673]
at org.apache.qpid.client.util.BlockingWaiter.block(BlockingWaiter.java:175)
at org.apache.qpid.client.protocol.BlockingMethodFrameListener.blockForFrame(BlockingMethodFrameListener.java:118)
at org.apache.qpid.client.AMQProtocolHandler.writeCommandFrameAndWaitForReply(AMQProtocolHandler.java:654)
at org.apache.qpid.client.AMQProtocolHandler.syncWrite(AMQProtocolHandler.java:675)
at org.apache.qpid.client.AMQSession_0_8.sendClose(AMQSession_0_8.java:253)
at org.apache.qpid.client.AMQSession.close(AMQSession.java:819)
... 12 common frames omitted
2021-12-14T19:24:18.969+0000 WARN EvaluationSubscriber Encountered an error while attempting to close a broker connection within subscriber gTLSRyGqvr25RYrfgulGUTTtxLI.
javax.jms.JMSException: Error closing connection: javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50436-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1270)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1252)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1252)
at org.apache.qpid.client.AMQConnection.close(AMQConnection.java:1230)
at org.apache.qpid.client.AMQConnection.close(AMQConnection.java:1209)
at wres.eventsbroker.BrokerConnectionFactory$WrappedConnection.close(BrokerConnectionFactory.java:1024)
at wres.events.subscribe.EvaluationSubscriber.close(EvaluationSubscriber.java:392)
at wres.events.subscribe.EvaluationSubscriber$1.run(EvaluationSubscriber.java:1466)
at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
at java.base/java.util.TimerThread.run(Timer.java:506)
Caused by: javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50436-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
at org.apache.qpid.client.AMQSession.close(AMQSession.java:825)
at org.apache.qpid.client.AMQSession.close(AMQSession.java:787)
at org.apache.qpid.client.AMQConnection.closeAllSessions(AMQConnection.java:1333)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1265)
... 9 common frames omitted
Caused by: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50436-eventsbroker/172.19.254.194:5673]
at org.apache.qpid.client.util.BlockingWaiter.block(BlockingWaiter.java:175)
at org.apache.qpid.client.protocol.BlockingMethodFrameListener.blockForFrame(BlockingMethodFrameListener.java:118)
at org.apache.qpid.client.AMQProtocolHandler.writeCommandFrameAndWaitForReply(AMQProtocolHandler.java:654)
at org.apache.qpid.client.AMQProtocolHandler.syncWrite(AMQProtocolHandler.java:675)
at org.apache.qpid.client.AMQSession_0_8.sendClose(AMQSession_0_8.java:253)
at org.apache.qpid.client.AMQSession.close(AMQSession.java:819)
... 12 common frames omitted
2021-12-14T19:24:26.076+0000 WARN AMQConnection Error closing session: javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50444-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
2021-12-14T19:24:26.076+0000 WARN AMQConnection Error closing connection
javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50444-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
at org.apache.qpid.client.AMQSession.close(AMQSession.java:825)
at org.apache.qpid.client.AMQSession.close(AMQSession.java:787)
at org.apache.qpid.client.AMQConnection.closeAllSessions(AMQConnection.java:1333)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1265)
at org.apache.qpid.client.AMQConnection.doClose(AMQConnection.java:1252)
at org.apache.qpid.client.AMQConnection.close(AMQConnection.java:1230)
at org.apache.qpid.client.AMQConnection.close(AMQConnection.java:1209)
at wres.eventsbroker.BrokerConnectionFactory$WrappedConnection.close(BrokerConnectionFactory.java:1024)
at wres.events.publish.MessagePublisher.close(MessagePublisher.java:259)
at wres.events.subscribe.EvaluationSubscriber.close(EvaluationSubscriber.java:412)
at wres.events.subscribe.EvaluationSubscriber$1.run(EvaluationSubscriber.java:1466)
at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
at java.base/java.util.TimerThread.run(Timer.java:506)
Caused by: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50444-eventsbroker/172.19.254.194:5673]
at org.apache.qpid.client.util.BlockingWaiter.block(BlockingWaiter.java:175)
at org.apache.qpid.client.protocol.BlockingMethodFrameListener.blockForFrame(BlockingMethodFrameListener.java:118)
at org.apache.qpid.client.AMQProtocolHandler.writeCommandFrameAndWaitForReply(AMQProtocolHandler.java:654)
at org.apache.qpid.client.AMQProtocolHandler.syncWrite(AMQProtocolHandler.java:675)
at org.apache.qpid.client.AMQSession_0_8.sendClose(AMQSession_0_8.java:253)
at org.apache.qpid.client.AMQSession.close(AMQSession.java:819)
... 12 common frames omitted
2021-12-14T19:24:26.076+0000 ERROR Main The WRES encountered an uncaught exception in thread Thread[Timer-1,5,main].
wres.events.EvaluationEventException: Failed to publish a message with metadata {JMSCorrelationID=cmMUX7EKjwui91d6FvdCeBeDqe8, JMSMessageID=ID:cmMUX7EKjwui91d6FvdCeBeDqe8-mPhsLA_qP-VzQkkLuArbCX_qt8CE, EvaluationJobID=wres_job_4585379827107895713, PNG=TbBnaXCczjPB-1qqE6qb1XaNwss, NETCDF=gTLSRyGqvr25RYrfgulGUTTtxLI} for evaluation cmMUX7EKjwui91d6FvdCeBeDqe8 to amq.topic/status
at wres.events.Evaluation.internalPublish(Evaluation.java:1518)
at wres.events.Evaluation.notifyAlive(Evaluation.java:1134)
at wres.events.Evaluation$1.run(Evaluation.java:1562)
at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
at java.base/java.util.TimerThread.run(Timer.java:506)
Caused by: wres.events.EvaluationEventException: Failed to publish a message.
at wres.events.publish.MessagePublisher.publish(MessagePublisher.java:315)
at wres.events.Evaluation.internalPublish(Evaluation.java:1504)
... 4 common frames omitted
Caused by: wres.events.publish.UnrecoverablePublisherException: Publisher dU2b8fuCrraxNpvh54YOSgSyUqs failed to send message ID:cmMUX7EKjwui91d6FvdCeBeDqe8-mPhsLA_qP-VzQkkLuArbCX_qt8CE to destination topic://amq.topic/status/tmp_172_19_254_199_42254_1?routingkey='status'&autodelete='true' after 4 retries.
at wres.events.publish.MessagePublisher.internalPublishWithRetriesAndExponentialBackoff(MessagePublisher.java:410)
at wres.events.publish.MessagePublisher.publish(MessagePublisher.java:311)
... 5 common frames omitted
Caused by: javax.jms.JMSException: Exception whilst sending:sender for socket /172.19.254.199:42258-eventsbroker/172.19.254.194:5673 is closed
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3787)
at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:559)
at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:371)
at wres.events.publish.MessagePublisher.internalPublishWithRetriesAndExponentialBackoff(MessagePublisher.java:384)
... 6 common frames omitted
Caused by: org.apache.qpid.transport.SenderClosedException: sender for socket /172.19.254.199:42258-eventsbroker/172.19.254.194:5673 is closed
at org.apache.qpid.transport.network.io.IoSender.checkNotAlreadyClosed(IoSender.java:377)
at org.apache.qpid.transport.network.io.IoSender.send(IoSender.java:122)
at org.apache.qpid.framing.AMQFrame.writePayload(AMQFrame.java:69)
at org.apache.qpid.framing.CompositeAMQDataBlock.writePayload(CompositeAMQDataBlock.java:58)
at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:575)
at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:568)
at org.apache.qpid.client.BasicMessageProducer_0_8.sendMessage(BasicMessageProducer_0_8.java:368)
at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:554)
... 8 common frames omitted
2021-12-14T19:24:26.076+0000 ERROR EvaluationStatusTracker While attempting to consume a message with identifier ID:2523a329-6b58-3274-9a00-0351360c9964 and correlation identifier cmMUX7EKjwui91d6FvdCeBeDqe8 in evaluation status tracker ENnUozMzvbKtmJIGsXravxSMjZw, encountered an error. The session will now attempt to recover. This is 1 of 4 allowed consumption failures before the status tracker will notify an unrecoverable failure for evaluation cmMUX7EKjwui91d6FvdCeBeDqe8, unless the status tracker is otherwise marked failed (in which case further retries may not occur).
javax.jms.JMSException: Exception while acknowledging message(s):sender for socket /172.19.254.199:42254-eventsbroker/172.19.254.194:5673 is closed
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3787)
at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:544)
at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
at wres.events.EvaluationStatusTracker.lambda$registerListenerForConsumer$4(EvaluationStatusTracker.java:1126)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.qpid.transport.SenderClosedException: sender for socket /172.19.254.199:42254-eventsbroker/172.19.254.194:5673 is closed
at org.apache.qpid.transport.network.io.IoSender.checkNotAlreadyClosed(IoSender.java:377)
at org.apache.qpid.transport.network.io.IoSender.send(IoSender.java:122)
at org.apache.qpid.framing.AMQFrame.writePayload(AMQFrame.java:69)
at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:575)
at org.apache.qpid.client.AMQSession_0_8.acknowledgeMessage(AMQSession_0_8.java:172)
at org.apache.qpid.client.AMQSession_0_8.acknowledgeImpl(AMQSession_0_8.java:144)
at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:539)
... 12 common frames omitted
2021-12-14T19:24:47.987+0000 WARN MessagePublisher Encountered an error while attempting to close a connection within message publisher MNcOi2a0tnE7i4buYu6dnGzPua8: Error closing connection: javax.jms.JMSException: Error closing session: org.apache.qpid.AMQTimeoutException: The server's response was not received within the time-out period of 2000 ms. Possible reasons include: the server may be too busy, the network may be overloaded, or this JVM itself may be too busy to process the response. [/172.19.254.199:50444-eventsbroker/172.19.254.194:5673] [error code: 408(Request Timeout)]
2021-12-14T19:25:59.798+0000 ERROR EvaluationStatusTracker While attempting to recover a session for evaluation cmMUX7EKjwui91d6FvdCeBeDqe8 in evaluation status tracker ENnUozMzvbKtmJIGsXravxSMjZw, encountered an error that prevented recovery.
javax.jms.JMSException: Recover failed: Fail-over interrupted suspend/unsuspend channel.
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3821)
at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1914)
at wres.events.EvaluationStatusTracker.recover(EvaluationStatusTracker.java:1258)
at wres.events.EvaluationStatusTracker.lambda$registerListenerForConsumer$4(EvaluationStatusTracker.java:1137)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.qpid.QpidException: Fail-over interrupted suspend/unsuspend channel.
at org.apache.qpid.client.AMQSession.suspendChannel(AMQSession.java:3288)
at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1884)
... 11 common frames omitted
Caused by: org.apache.qpid.client.failover.FailoverException: Failing over about to start
at org.apache.qpid.client.AMQProtocolHandler.notifyFailoverStarting(AMQProtocolHandler.java:437)
at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:286)
... 1 common frames omitted
2021-12-14T19:26:21.057+0000 ERROR EvaluationStatusTracker While attempting to consume a message with identifier ID:449a17df-0a31-3925-b11a-cfcd4fa7cbcf and correlation identifier cmMUX7EKjwui91d6FvdCeBeDqe8 in evaluation status tracker ENnUozMzvbKtmJIGsXravxSMjZw, encountered an error. The session will now attempt to recover. This is 2 of 4 allowed consumption failures before the status tracker will notify an unrecoverable failure for evaluation cmMUX7EKjwui91d6FvdCeBeDqe8, unless the status tracker is otherwise marked failed (in which case further retries may not occur).
javax.jms.JMSException: Exception while acknowledging message(s):sender for socket /172.19.254.199:42254-eventsbroker/172.19.254.194:5673 is closed
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3787)
at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:544)
at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
at wres.events.EvaluationStatusTracker.lambda$registerListenerForConsumer$4(EvaluationStatusTracker.java:1126)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.qpid.transport.SenderClosedException: sender for socket /172.19.254.199:42254-eventsbroker/172.19.254.194:5673 is closed
at org.apache.qpid.transport.network.io.IoSender.checkNotAlreadyClosed(IoSender.java:377)
at org.apache.qpid.transport.network.io.IoSender.send(IoSender.java:122)
at org.apache.qpid.framing.AMQFrame.writePayload(AMQFrame.java:69)
at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:575)
at org.apache.qpid.client.AMQSession_0_8.acknowledgeMessage(AMQSession_0_8.java:172)
at org.apache.qpid.client.AMQSession_0_8.acknowledgeImpl(AMQSession_0_8.java:144)
at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:539)
... 12 common frames omitted
2021-12-14T19:26:41.568+0000 ERROR EvaluationStatusTracker While attempting to recover a session for evaluation cmMUX7EKjwui91d6FvdCeBeDqe8 in evaluation status tracker ENnUozMzvbKtmJIGsXravxSMjZw, encountered an error that prevented recovery.
javax.jms.JMSException: Recovery was interrupted by fail-over. Recovery status is not known.
at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1918)
at wres.events.EvaluationStatusTracker.recover(EvaluationStatusTracker.java:1258)
at wres.events.EvaluationStatusTracker.lambda$registerListenerForConsumer$4(EvaluationStatusTracker.java:1137)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.qpid.client.failover.FailoverException: Failing over about to start
at org.apache.qpid.client.AMQProtocolHandler.notifyFailoverStarting(AMQProtocolHandler.java:437)
at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:286)
... 1 common frames omitted
2021-12-14T19:28:46.876+0000 INFO EvaluationConsumer Consumer gTLSRyGqvr25RYrfgulGUTTtxLI closed evaluation cmMUX7EKjwui91d6FvdCeBeDqe8, which completed unsuccessfully.
2021-12-14T19:28:46.876+0000 ERROR AMQConnection Exception occurred in ExceptionListener
java.lang.IllegalArgumentException: Message publisher MNcOi2a0tnE7i4buYu6dnGzPua8 has been closed and cannot accept any further publication requests.
at wres.events.publish.MessagePublisher.publish(MessagePublisher.java:293)
at wres.events.subscribe.EvaluationConsumer.publishCompletionState(EvaluationConsumer.java:731)
at wres.events.subscribe.EvaluationConsumer.notifyFailure(EvaluationConsumer.java:620)
at wres.events.subscribe.EvaluationConsumer.markEvaluationFailedOnConsumption(EvaluationConsumer.java:291)
at wres.events.subscribe.EvaluationSubscriber.markSubscriberFailed(EvaluationSubscriber.java:1291)
at wres.events.subscribe.EvaluationSubscriber$ConnectionExceptionListener.onException(EvaluationSubscriber.java:1777)
at org.apache.qpid.client.AMQConnection$2.run(AMQConnection.java:1686)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2021-12-14T19:27:32.601+0000 ERROR EvaluationStatusTracker While tracking evaluation cmMUX7EKjwui91d6FvdCeBeDqe8, encountered an error in a messaging client that prevented the evaluation from succeeding. The error message is:
client_id: "ENnUozMzvbKtmJIGsXravxSMjZw"
completion_status: EVALUATION_COMPLETE_REPORTED_FAILURE
status_events {
event_type: ERROR
event_message: "While completing evaluation cmMUX7EKjwui91d6FvdCeBeDqe8, encountered an error on a connection owned by the evaluation status tracker responsible for tracking this evaluation. If a failover policy was configured on the connection factory (e.g., connection retries), then that policy was exhausted before this error was thrown. As such, the error is not recoverable and the evaluation will now stop."
}
2021-12-14T19:29:23.919+0000 WARN Evaluation Unable to publish the completion status of evaluation cmMUX7EKjwui91d6FvdCeBeDqe8, which was EVALUATION_COMPLETE_REPORTED_FAILURE.
2021-12-14T19:37:48.125+0000 WARN HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=46s216ms214µs713ns).
2021-12-14T19:37:48.127+0000 WARN ProxyConnection HikariPool-1 - Connection org.postgresql.jdbc.PgConnection@b2a648 marked as broken because of SQLSTATE(08006), ErrorCode(0)
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:381)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:314)
at org.postgresql.jdbc.PgConnection.executeTransactionCommand(PgConnection.java:853)
at org.postgresql.jdbc.PgConnection.commit(PgConnection.java:875)
at org.postgresql.jdbc.PgConnection.setAutoCommit(PgConnection.java:817)
at com.zaxxer.hikari.pool.ProxyConnection.setAutoCommit(ProxyConnection.java:414)
at com.zaxxer.hikari.pool.HikariProxyConnection.setAutoCommit(HikariProxyConnection.java)
at com.github.marschall.jfr.jdbc.JfrConnection.setAutoCommit(JfrConnection.java:275)
at wres.io.utilities.SQLDataProvider.close(SQLDataProvider.java:883)
at wres.io.retrieval.TimeSeriesRetriever.getTimeSeriesFromScript(TimeSeriesRetriever.java:242)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:122)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:23)
at wres.io.retrieval.CachingRetriever.lambda$new$0(CachingRetriever.java:56)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:41)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:18)
at wres.io.pooling.PoolSupplier.createPool(PoolSupplier.java:281)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:254)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:83)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.pooling.PoolFactory.lambda$decompose$17(PoolFactory.java:1100)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:237)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:36)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Unexpected packet type: 48
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2516)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:354)
... 26 common frames omitted
2021-12-14T19:37:48.128+0000 WARN SQLDataProvider Unable to set connection com.github.marschall.jfr.jdbc.JfrConnection@6b20934e to autocommit=true
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:381)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:314)
at org.postgresql.jdbc.PgConnection.executeTransactionCommand(PgConnection.java:853)
at org.postgresql.jdbc.PgConnection.commit(PgConnection.java:875)
at org.postgresql.jdbc.PgConnection.setAutoCommit(PgConnection.java:817)
at com.zaxxer.hikari.pool.ProxyConnection.setAutoCommit(ProxyConnection.java:414)
at com.zaxxer.hikari.pool.HikariProxyConnection.setAutoCommit(HikariProxyConnection.java)
at com.github.marschall.jfr.jdbc.JfrConnection.setAutoCommit(JfrConnection.java:275)
at wres.io.utilities.SQLDataProvider.close(SQLDataProvider.java:883)
at wres.io.retrieval.TimeSeriesRetriever.getTimeSeriesFromScript(TimeSeriesRetriever.java:242)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:122)
at wres.io.retrieval.ObservationRetriever.get(ObservationRetriever.java:23)
at wres.io.retrieval.CachingRetriever.lambda$new$0(CachingRetriever.java:56)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:41)
at wres.io.retrieval.CachingRetriever.get(CachingRetriever.java:18)
at wres.io.pooling.PoolSupplier.createPool(PoolSupplier.java:281)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:254)
at wres.io.pooling.PoolSupplier.get(PoolSupplier.java:83)
at wres.io.retrieval.CachingSupplier.get(CachingSupplier.java:51)
at wres.io.pooling.PoolFactory.lambda$decompose$17(PoolFactory.java:1100)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:237)
at wres.pipeline.PoolProcessor.get(PoolProcessor.java:36)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Unexpected packet type: 48
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2516)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:354)
... 26 common frames omitted
Thanks,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:19:56Z
I've created two *batc#yml files for my modified versions. The option modified:
- JAVA_OPTS=-Dwres.broker=${WRES_BROKER_HOST} -Dcom.redhat.fips=false -Djava.io.tmpdir=/mnt/wres_share/evaluations -Dwres.featureBatchSize=1 -XX:OnOutOfMemoryError='mv /mnt/wres_share/heap_dumps/
worker-shim/java_pid%p.hprof /mnt/wres_share/heap_dumps/worker-shim/java_pid%p_`hostname`.hprof; chmod 775 /mnt/wres_share/heap_dumps/worker-shim/java_pid%p_`hostname`.hprof'
I think that's safe to share here, right? Anyway, if that's wrong let me know. If I don't hear otherwise, I'll bring down the containers and bring them back up with this change at 3:30pm EST. I'll then push through the smoke test evaluation to confirm if the batch size change worked.
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:31:43Z
Bringing down the containers and bringing them back up with the new .yml,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:36:46Z
I see no confirmation that my change took in the log. I was hoping to see some mention of it in this log message, but I don't:
2021-12-14T20:34:31.287+0000 INFO Main WRES version 20211213-85f9223
2021-12-14T20:34:31.306+0000 INFO Main Processors: 8; Max Memory: 2493MiB; Free Memory: 2408MiB; Total Memory: 2493MiB; awt.toolkit: sun.awt.X11.XToolkit; com.redhat.fips: false; file.encoding: UTF-8; java.awt.graphicsenv: sun.awt.X11GraphicsEnvironment; java.class.version: 55.0; java.home: /usr/lib/jvm/java-11-openjdk-11.0.13.0.8-3.el8_5.x86_64; java.io.tmpdir: /mnt/wres_share/evaluations; java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib; java.runtime.name: OpenJDK Runtime Environment; java.runtime.version: 11.0.13+8-LTS; java.util.logging.config.class: wres.system.logging.JavaUtilLoggingRedirector; java.version: 11.0.13; java.version.date: 2021-10-19; java.vm.compressedOopsMode: Zero based; java.vm.info: mixed mode, sharing; java.vm.name: OpenJDK 64-Bit Server VM; java.vm.vendor: Red Hat, Inc.; java.vm.version: 11.0.13+8-LTS; jdk.debug: release; os.arch: amd64; os.name: Linux; os.version: 3.10.0-1160.49.1.el7.x86_64; ucar.unidata.io.http.httpBufferSize: 200000; ucar.unidata.io.http.maxReadCacheSize: 200000; user.dir: /opt/wres-worker-20211213-85f9223; user.home: /container_home; user.language: en; user.name: wres_docker; user.timezone: UTC; wres.attemptToMigrate: false; wres.certificateFileToTrust: /wres_secrets/nwcal_ca_2021.pem; wres.databaseHost: nwcal-wresdb-ti01.[host]; wres.databaseName: wres8; wres.eventsBrokerAddress: eventsbroker; wres.eventsBrokerPort: 5673; wres.externalGraphics: true; wres.jobId: wres_job_4328313791723426573; wres.username: wres_user8
Should I?
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:42:53Z
I'm also seeing exceptions that I'm not sure about on both the entry and worker machines. For example:
eventsbroker_1 | 2021-12-14T20:38:09,553Z ERROR [Broker-Config] (o.a.q.s.m.AbstractConfiguredObject) - Failed to open object with name 'wres-truststore'. Object will be put into ERROR state.
eventsbroker_1 | org.apache.qpid.server.configuration.IllegalConfigurationException: Cannot load certificate(s) :java.security.KeyStoreException: sun.security.pkcs11.wrapper.PKCS11Exception: CKR_SESSION_READ_ONLY
eventsbroker_1 | at org.apache.qpid.server.security.NonJavaTrustStoreImpl.initialize(NonJavaTrustStoreImpl.java:159)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
eventsbroker_1 | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
eventsbroker_1 | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.automatedSetValue(AbstractConfiguredObject.java:505)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.resolveAutomatedAttribute(AbstractConfiguredObject.java:1487)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.onResolve(AbstractConfiguredObject.java:1441)
...
and
eventsbroker_1 | 2021-12-14T20:38:10,068Z ERROR [Broker-Config] (o.a.q.s.m.AbstractConfiguredObject) - Failed to open object with name 'httpManagement'. Object will be put into ERROR state.
eventsbroker_1 | java.lang.IllegalStateException: Truststore TrustStore[id=a711f6b3-c46b-4229-b6ab-dbbba5afff95, name=wres-truststore, type=NonJavaTrustStore] defines no trust managers
eventsbroker_1 | at org.apache.qpid.server.security.NonJavaTrustStoreImpl.getTrustManagersInternal(NonJavaTrustStoreImpl.java:88)
eventsbroker_1 | at org.apache.qpid.server.security.AbstractTrustStore.getTrustManagers(AbstractTrustStore.java:360)
eventsbroker_1 | at org.apache.qpid.server.transport.network.security.ssl.SSLUtil.createSslContext(SSLUtil.java:943)
eventsbroker_1 | at org.apache.qpid.server.management.plugin.HttpManagement.createSslContext(HttpManagement.java:737)
eventsbroker_1 | at org.apache.qpid.server.management.plugin.HttpManagement.createSslContextFactory(HttpManagement.java:706)
eventsbroker_1 | at org.apache.qpid.server.management.plugin.HttpManagement.createConnector(HttpManagement.java:576)
eventsbroker_1 | at org.apache.qpid.server.management.plugin.HttpManagement.createServer(HttpManagement.java:310)
eventsbroker_1 | at org.apache.qpid.server.management.plugin.HttpManagement.doStart(HttpManagement.java:224)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
eventsbroker_1 | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
eventsbroker_1 | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1536)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1515)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1081)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1075)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$25$1.run(AbstractConfiguredObject.java:2957)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$25$1.run(AbstractConfiguredObject.java:2953)
eventsbroker_1 | at java.base/java.security.AccessController.doPrivileged(Native Method)
eventsbroker_1 | at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$25.onSuccess(AbstractConfiguredObject.java:2952)
eventsbroker_1 | at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1080)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$ImmediateIfSameThreadExecutor.execute(TaskExecutorImpl.java:400)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl.execute(TaskExecutorImpl.java:183)
eventsbroker_1 | at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
eventsbroker_1 | at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:724)
eventsbroker_1 | at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:126)
...
Are those expected?
Hank
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T20:49:12Z
Sorry, Hank, I should've looked more closely at your change. You want the INNER options. It's always the inner options for the application.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T20:52:35Z
Oh, I know that. I was just testing you. :)
Should have been paying more attention. Trying it again,
Hank
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T20:55:04Z
Those errors are NOT expected.
It might be related to the fips changes that Jesse was making, idk (edit: or rather the ubi8 image update or whatever caused the failure of the tasker to read the pkcs12 bundle).
Looks like it is probably limited to the http management console, though, which is a post-release gate in our release pipeline. So yet another thing to fix, but probably not fatal for the correct operation of the service.
Cannot load certificate(s) :java.security.KeyStoreException: sun.security.pkcs11.wrapper.PKCS11Exception: CKR_SESSION_READ_ONLY
edit: I note this one is about the protocol though, rather than the cert/key bundle.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T21:03:50Z
Thanks, James. I've pasted a more complete exception below, though there are others reported. It has problems parsing the key.
I've brought up the service, confirmed in a smoke test that the batch size change was picked up, and have started the WPOD evaluation run (by itself; not as part of Test A):
https://nwcal-wres-ti.[host]/job/1812553047642649071
Its being processed on the -ti02 in container 6d12cc8221c7.
My work day is done. I'll report back on the run in the morning. If this should have been run in a full Test A, I can kill the job and start Test A. I'll check my emails in a half-hour, so just let me know.
Thanks,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-12-14T21:05:00Z
Here is the exception,
Hank
====================================================================
eventsbroker_1 | 2021-12-14T20:56:32,862Z ERROR [Broker-Config] (o.a.q.s.m.AbstractConfiguredObject) - Failed to open object with name 'wres-keystore'. Object will be put into ERROR state.
eventsbroker_1 | org.apache.qpid.server.configuration.IllegalConfigurationException: Cannot load private key or certificate(s): java.security.spec.InvalidKeySpecException: Cannot parse the provided key as either PKCS#1 or PCKS#8 format
eventsbroker_1 | at org.apache.qpid.server.security.NonJavaKeyStoreImpl.updateKeyManagers(NonJavaKeyStoreImpl.java:263)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
eventsbroker_1 | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
eventsbroker_1 | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.automatedSetValue(AbstractConfiguredObject.java:505)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.resolveAutomatedAttribute(AbstractConfiguredObject.java:1487)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.onResolve(AbstractConfiguredObject.java:1441)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.doResolution(AbstractConfiguredObject.java:1223)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$11.performAction(AbstractConfiguredObject.java:1236)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.applyToChildren(AbstractConfiguredObject.java:1323)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.doResolution(AbstractConfiguredObject.java:1225)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:596)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:586)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:647)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:640)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$TaskLoggingWrapper.execute(TaskExecutorImpl.java:248)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl.submitWrappedTask(TaskExecutorImpl.java:165)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl.submit(TaskExecutorImpl.java:153)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.doOnConfigThread(AbstractConfiguredObject.java:639)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.openAsync(AbstractConfiguredObject.java:585)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractSystemConfig.makeActive(AbstractSystemConfig.java:270)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractSystemConfig.activate(AbstractSystemConfig.java:246)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
eventsbroker_1 | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
eventsbroker_1 | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
eventsbroker_1 | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1536)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1515)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1081)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1075)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$25$1.run(AbstractConfiguredObject.java:2957)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$25$1.run(AbstractConfiguredObject.java:2953)
eventsbroker_1 | at java.base/java.security.AccessController.doPrivileged(Native Method)
eventsbroker_1 | at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$25.onSuccess(AbstractConfiguredObject.java:2952)
eventsbroker_1 | at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1080)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$ImmediateIfSameThreadExecutor.execute(TaskExecutorImpl.java:400)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl.execute(TaskExecutorImpl.java:183)
eventsbroker_1 | at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
eventsbroker_1 | at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:724)
eventsbroker_1 | at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:126)
eventsbroker_1 | at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1047)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.addFutureCallback(AbstractConfiguredObject.java:2947)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.doAttainState(AbstractConfiguredObject.java:1074)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject.access$600(AbstractConfiguredObject.java:105)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:599)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:586)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:647)
eventsbroker_1 | at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:640)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$TaskLoggingWrapper.execute(TaskExecutorImpl.java:248)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper$1.run(TaskExecutorImpl.java:320)
eventsbroker_1 | at java.base/java.security.AccessController.doPrivileged(Native Method)
eventsbroker_1 | at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
eventsbroker_1 | at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper.call(TaskExecutorImpl.java:313)
eventsbroker_1 | at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
eventsbroker_1 | at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
eventsbroker_1 | at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
eventsbroker_1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
eventsbroker_1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
eventsbroker_1 | at org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
eventsbroker_1 | at java.base/java.lang.Thread.run(Thread.java:829)
eventsbroker_1 | Caused by: java.security.spec.InvalidKeySpecException: Cannot parse the provided key as either PKCS#1 or PCKS#8 format
eventsbroker_1 | at org.apache.qpid.server.transport.network.security.ssl.SSLUtil.readPrivateKey(SSLUtil.java:574)
eventsbroker_1 | at org.apache.qpid.server.transport.network.security.ssl.SSLUtil.readPrivateKey(SSLUtil.java:533)
eventsbroker_1 | at org.apache.qpid.server.transport.network.security.ssl.SSLUtil.readPrivateKey(SSLUtil.java:498)
eventsbroker_1 | at org.apache.qpid.server.security.NonJavaKeyStoreImpl.updateKeyManagers(NonJavaKeyStoreImpl.java:234)
eventsbroker_1 | ... 61 common frames omitted
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T21:34:33Z
James wrote:
Those errors are NOT expected.
It might be related to the fips changes that Jesse was making, idk (edit: or rather the ubi8 image update or whatever caused the failure of the tasker to read the pkcs12 bundle).
Looks like it is probably limited to the http management console, though, which is a post-release gate in our release pipeline. So yet another thing to fix, but probably not fatal for the correct operation of the service.
Cannot load certificate(s) :java.security.KeyStoreException: sun.security.pkcs11.wrapper.PKCS11Exception: CKR_SESSION_READ_ONLY
edit: I note this one is about the protocol though, rather than the cert/key bundle.
I did leave some differentish certificates on one of the test machines so if this happens on machine A but not machine B it is probably related. Looking.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-12-14T21:54:57Z
The different certs does not seem related because the eventsbroker has a different certificate, right?
Original Redmine Comment Author Name: James (James) Original Date: 2021-12-14T21:56:23Z
It has its own cert, yes.
"keystores": [
{
"name": "wres-keystore",
"type": "NonJavaKeyStore",
"privateKeyUrl": "${wres.secrets_dir}${file.separator}wres-eventsbroker_server_private_rsa_key.pem",
"certificateUrl": "${wres.secrets_dir}${file.separator}wres-eventsbroker_server_x509_cert.pem"
}
],
"truststores": [
{
"name": "wres-truststore",
"type": "NonJavaTrustStore",
"certificatesUrl": "${wres.secrets_dir}${file.separator}ca_x509_cert.pem"
}
]
</code>
Author Name: James (James) Original Redmine Issue: 95867, https://vlab.noaa.gov/redmine/issues/95867 Original Date: 2021-09-03 Original Assignee: James
Given an evaluation that contains N singleton features, such as scenario703 When I consider how to evaluate it performantly Then I want to consider feature-batched retrieval for N/M features at once
Related issue(s): #286 Redmine related issue(s): 98818, 99120, 99338, 99680, 99719, 99827, 99932, 99964, 99980, 109030