NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I don't want NPEs to happen when the tasker is unable to create a channel #168

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: Hank (Hank) Original Redmine Issue: 128980, https://vlab.noaa.gov/redmine/issues/128980 Original Date: 2024-04-16 Original Assignee: Evan


See #128962-21. The exceptions that are occurring look like this:

2024-04-14T16:11:21.319+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobStatusWatcher.run(JobStatusWatcher.java:104)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

Every queue watcher can encounter this exception per that comment. So all of them must be modified to avoid the NPE and to react reasonably when channels can't be created. Specifically, if all 5 channels cannot be created, we should assume something has gone wrong. Delete the channels that were created and return some sort of error to the user; perhaps a 500. We should also not leave the job's status as @CREATED@, which is how the example I found was marked.

I'll post the complete tasker logging related to the event in the first comment. I'm estimating this to be a low-hanging fruit, but could be wrong.

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-16T11:51:04Z


More logging is posted below.

I think this inability to create channels was what caused the COWRES to become seemingly unresponsive to posted evaluations and eventually requests time out. So we need to ensure posts don't timeout if channels fail to create.

Hank

================================================

2024-04-14T16:11:21.190+0000 INFO WresJob ==========> REQUEST POSTED: verb = 'execute'; postInput = false; additional arguments = '[]'.
2024-04-14T16:11:21.194+0000 INFO DeclarationValidator Encountered the following declaration string:
---
label: CBPR Precipitation
observed:
  label: OBS Precipitation
  sources: /home/Haksu/CBPR/CBRFC/cbpr.vs.mefp/Data/ALEC2HLF_MAP.xml
  variable: MAP
  feature_authority: nws lid
  type: observations
... SNIP ...
output_formats:
  - csv2
  - format: png
    orientation: lead threshold
2024-04-14T16:11:21.217+0000 INFO JobResults Watching the queue job.935063770884753560.STDOUT for STDOUT logging.
2024-04-14T16:11:21.218+0000 INFO JobResults Watching the queue job.935063770884753560.STDERR for STDERR logging.
2024-04-14T16:11:21.319+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobStatusWatcher.run(JobStatusWatcher.java:104)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:21.319+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:395)
    at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:313)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:21.319+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobOutputWatcher.run(JobOutputWatcher.java:104)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:22.769+0000 INFO RequestLog 10.3.13.38 - - "GET /job/5479189893529939891/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:11:22.820+0000 INFO RequestLog 10.3.13.26 - - "GET /job/5386476468870069453/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:11:24.793+0000 INFO RequestLog 10.3.13.38 - - "GET /job/5479189893529939891/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:11:24.842+0000 INFO RequestLog 10.3.13.26 - - "GET /job/5386476468870069453/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:11:26.815+0000 INFO RequestLog 10.3.13.38 - - "GET /job/5479189893529939891/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:11:26.863+0000 INFO RequestLog 10.3.13.26 - - "GET /job/5386476468870069453/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:11:28.402+0000 INFO WresJob ==========> REQUEST POSTED: verb = 'execute'; postInput = false; additional arguments = '[]'.
2024-04-14T16:11:28.404+0000 INFO DeclarationValidator Encountered the following declaration string:
---
label: CBPR Precipitation
observed:
  label: OBS Precipitation
  sources: /home/Haksu/CBPR/CBRFC/cbpr.vs.gefsv12/Data/ALEC2HLF_MAP.xml
  variable: MAP
  feature_authority: nws lid
  type: observations
... SNIP ...
output_formats:
  - csv2
  - format: png
    orientation: lead threshold
2024-04-14T16:11:28.418+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:731)
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:646)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:28.420+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:731)
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:646)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:28.422+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobOutputWatcher.run(JobOutputWatcher.java:104)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:28.424+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobStatusWatcher.run(JobStatusWatcher.java:104)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:28.426+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:395)
    at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:313)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:11:28.841+0000 INFO RequestLog 10.3.13.38 - - "GET /job/5479189893529939891/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:11:28.885+0000 INFO RequestLog 10.3.13.26 - - "GET /job/5386476468870069453/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
... LONG SNIP ...
2024-04-14T16:45:10.148+0000 INFO WresJob ==========> REQUEST POSTED: verb = 'execute'; postInput = false; additional arguments = '[]'.
2024-04-14T16:45:10.150+0000 INFO DeclarationValidator Encountered the following declaration string:
---
label: CBPR Precipitation
observed:
  label: OBS Precipitation
  sources: /home/Haksu/CBPR/CBRFC/cbpr.vs.mefp/Data/ALEC2HLF_MAP.xml
  variable: MAP
  feature_authority: nws lid
  type: observations
... SNIP ...
output_formats:
  - csv2
  - format: png
    orientation: lead threshold
2024-04-14T16:45:10.168+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobStatusWatcher.run(JobStatusWatcher.java:104)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:45:10.168+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobOutputWatcher.run(JobOutputWatcher.java:104)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:45:10.168+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:731)
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:646)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:45:10.168+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:395)
    at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:313)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:45:10.168+0000 WARN JobResults A throwable was received while watching for messages in a queue.
java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:731)
    at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:646)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-14T16:45:11.401+0000 INFO RequestLog 10.3.13.26 - - "GET /job/5386476468870069453/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
2024-04-14T16:45:11.501+0000 INFO RequestLog 10.3.13.38 - - "GET /job/5479189893529939891/status HTTP/1.1" 200 31 "-" "python-requests/2.25.1"
epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-16T11:55:18Z


During the time frame of the logging I showed above, the broker logs show nothing of interest. These are all standard messages:

2024-04-14 16:00:26.929384+00:00 [info] <0.30396.18> accepting AMQP connection <0.30396.18> (172.19.254.135:48694 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:00:26.932274+00:00 [info] <0.30396.18> connection <0.30396.18> (172.19.254.135:48694 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:00:26.934309+00:00 [info] <0.30396.18> closing AMQP connection <0.30396.18> (172.19.254.135:48694 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:05:26.888751+00:00 [info] <0.30486.18> accepting AMQP connection <0.30486.18> (172.19.254.135:48862 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:05:26.892238+00:00 [info] <0.30486.18> connection <0.30486.18> (172.19.254.135:48862 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:05:26.894155+00:00 [info] <0.30486.18> closing AMQP connection <0.30486.18> (172.19.254.135:48862 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:10:26.901316+00:00 [info] <0.30582.18> accepting AMQP connection <0.30582.18> (172.19.254.135:49026 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:10:26.904216+00:00 [info] <0.30582.18> connection <0.30582.18> (172.19.254.135:49026 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:10:26.905856+00:00 [info] <0.30582.18> closing AMQP connection <0.30582.18> (172.19.254.135:49026 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:15:26.912543+00:00 [info] <0.30723.18> accepting AMQP connection <0.30723.18> (172.19.254.135:49200 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:15:26.915005+00:00 [info] <0.30723.18> connection <0.30723.18> (172.19.254.135:49200 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:15:26.917791+00:00 [info] <0.30723.18> closing AMQP connection <0.30723.18> (172.19.254.135:49200 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:20:26.930405+00:00 [info] <0.30818.18> accepting AMQP connection <0.30818.18> (172.19.254.135:49368 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:20:26.933447+00:00 [info] <0.30818.18> connection <0.30818.18> (172.19.254.135:49368 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:20:26.935968+00:00 [info] <0.30818.18> closing AMQP connection <0.30818.18> (172.19.254.135:49368 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:25:26.785579+00:00 [info] <0.30909.18> accepting AMQP connection <0.30909.18> (172.19.254.135:49530 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:25:26.787885+00:00 [info] <0.30909.18> connection <0.30909.18> (172.19.254.135:49530 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:25:26.790311+00:00 [info] <0.30909.18> closing AMQP connection <0.30909.18> (172.19.254.135:49530 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:30:26.943489+00:00 [info] <0.31004.18> accepting AMQP connection <0.31004.18> (172.19.254.135:49700 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:30:26.946104+00:00 [info] <0.31004.18> connection <0.31004.18> (172.19.254.135:49700 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:30:26.948486+00:00 [info] <0.31004.18> closing AMQP connection <0.31004.18> (172.19.254.135:49700 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:35:26.838284+00:00 [info] <0.31094.18> accepting AMQP connection <0.31094.18> (172.19.254.135:49868 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:35:26.841378+00:00 [info] <0.31094.18> connection <0.31094.18> (172.19.254.135:49868 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:35:26.843721+00:00 [info] <0.31094.18> closing AMQP connection <0.31094.18> (172.19.254.135:49868 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:40:26.721837+00:00 [info] <0.31185.18> accepting AMQP connection <0.31185.18> (172.19.254.135:50024 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:40:26.724232+00:00 [info] <0.31185.18> connection <0.31185.18> (172.19.254.135:50024 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:40:26.726340+00:00 [info] <0.31185.18> closing AMQP connection <0.31185.18> (172.19.254.135:50024 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:45:26.894686+00:00 [info] <0.31299.18> accepting AMQP connection <0.31299.18> (172.19.254.135:50186 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:45:26.897358+00:00 [info] <0.31299.18> connection <0.31299.18> (172.19.254.135:50186 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:45:26.899736+00:00 [info] <0.31299.18> closing AMQP connection <0.31299.18> (172.19.254.135:50186 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:50:26.812973+00:00 [info] <0.31394.18> accepting AMQP connection <0.31394.18> (172.19.254.135:50354 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:50:26.815553+00:00 [info] <0.31394.18> connection <0.31394.18> (172.19.254.135:50354 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:50:26.818718+00:00 [info] <0.31394.18> closing AMQP connection <0.31394.18> (172.19.254.135:50354 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 16:55:26.730358+00:00 [info] <0.31485.18> accepting AMQP connection <0.31485.18> (172.19.254.135:50512 -> 172.19.254.130:5671)^[[0m
2024-04-14 16:55:26.733423+00:00 [info] <0.31485.18> connection <0.31485.18> (172.19.254.135:50512 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 16:55:26.735789+00:00 [info] <0.31485.18> closing AMQP connection <0.31485.18> (172.19.254.135:50512 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m
2024-04-14 17:00:26.935844+00:00 [info] <0.31575.18> accepting AMQP connection <0.31575.18> (172.19.254.135:50670 -> 172.19.254.130:5671)^[[0m
2024-04-14 17:00:26.938582+00:00 [info] <0.31575.18> connection <0.31575.18> (172.19.254.135:50670 -> 172.19.254.130:5671): user 'wres-tasker' authenticated and granted access to vhost 'wres'^[[0m
2024-04-14 17:00:26.941956+00:00 [info] <0.31575.18> closing AMQP connection <0.31575.18> (172.19.254.135:50670 -> 172.19.254.130:5671, vhost: 'wres', user: 'wres-tasker')^[[0m

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-16T12:01:12Z


Here is the code in the @JobStatusWatcher@:

   @Override
    public void run()
    {
        Consumer<GeneratedMessageV3> sharer = new JobStatusSharer( this.getJobMetadata() );

        BlockingQueue<JobStatus.job_status> jobStatusQueue
                = new ArrayBlockingQueue<>( LOCAL_Q_SIZE );

        String exchangeName = this.getJobStatusExchangeName();
        String exchangeType = "topic";
        String bindingKey = "job." + this.getJobId() + "." + TOPIC;
        String queueName = null;
        Channel channel = null;

        try 
        {
            channel = this.getConnection().createChannel();
            channel.exchangeDeclare( exchangeName, exchangeType, true );

            // As the consumer, I want an exclusive queue for me.
            queueName = channel.queueDeclare(bindingKey, true, false, false, null).getQueue();
            channel.queueBind( queueName, exchangeName, bindingKey );

            LOGGER.info("Watching the queue {} for status information on the evaluation.", queueName);

            JobStatusConsumer jobStatusConsumer =
                    new JobStatusConsumer( channel, jobStatusQueue );

            channel.basicConsume( queueName,
                                  true,
                                  jobStatusConsumer );
            this.getCountDownLatch().countDown();
            LOGGER.info("Now waiting for messages in the queue {}.", queueName);
            JobMessageHelper.waitForAllMessages( queueName,
                                                 this.getJobId(),
                                                 jobStatusQueue,
                                                 sharer,
                                                 TOPIC );
            LOGGER.info("Done waiting for messages in the queue {}.", queueName);
        }
        catch ( InterruptedException ie )
        {
            LOGGER.warn( "Interrupted while getting status for job {}",
                         this.getJobId(), ie );
            Thread.currentThread().interrupt();
        }
        catch ( IOException e )
        {
            // Since we may or may not actually consume result, log exception here
            LOGGER.warn( "When attempting to get job status message using {}:",
                         this, e );
        }
        finally
        {
            if ( (queueName != null) && (channel != null) )
            {
                try
                {
                    LOGGER.info( "Deleting the queue {}", queueName );
                    AMQP.Queue.DeleteOk deleteOk = channel.queueDelete(queueName);
                    if (deleteOk == null)
                    {
                        LOGGER.warn( "Delete queue with name {} failed. There might be a zombie queue.", queueName );
                    }
                }
                catch ( IOException e )
                {
                    LOGGER.warn( "Delete queue with name {} failed due to an exception. There might be a zombie queue.", queueName, e );
                }
            }
        }
    }
</code>

Note that SonarQube or whatever its called is flagging that the channel is never closed. It recommends doing a try-with-resources or closing it in the finally.

For this ticket, it appears that @channel = this.getConnection().createChannel();@ can result in a null @channel@ and no exception thrown. "Javadoc":https://javadoc.io/doc/com.rabbitmq/amqp-client/latest/index.html confirms it:

Channel createChannel()
               throws IOException
Create a new channel, using an internally allocated channel number. If automatic connection recovery is enabled, the channel returned by this method will be Recoverable.
Use openChannel() if you want to use an Optional to deal with a value.

Returns:
a new channel descriptor, or null if none is available
Throws:
IOException - if an I/O problem is encountered

Null is possible and we should have checked for it before trying to use a created channel.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-16T12:20:52Z


Could it have reached some sort of upper limit in channels due to the lack of closing them properly?

Anyway, I need to UAT the WRES GUI, so I'm changing my focus,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-16T12:24:31Z


There is a maximum number of channels per connection per "this":https://www.rabbitmq.com/docs/channels#:~:text=Maximum%20Number%20of%20Channels%20per,both%20RabbitMQ%20and%20client%20libraries:

Maximum Number of Channels per Connection​
The maximum number of channels that can be open on a connection simultaneously is negotiated by client and server at connection time. The value is configurable for both RabbitMQ and client libraries.

I have no idea if this is related, however. Whoever tackles this (I might be able to later this week) should check to see how the connections and channels are being handled and if that maximum is hit. The @createChannel@ call is returning null. Why is it doing so?

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T14:20:50Z


Hank wrote:

The @createChannel@ call is returning null. Why is it doing so?

This @createChannel@ is creating a channel with the broker to facilitate communication. My guess is that since the broker was UNHEALTHY when all of the mcaffee stuff was ongoing that we were unable to create a new channel

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T14:31:12Z


Hank wrote:

There is a maximum number of channels per connection per "this":https://www.rabbitmq.com/docs/channels#:~:text=Maximum%20Number%20of%20Channels%20per,both%20RabbitMQ%20and%20client%20libraries:

Also on that link shows a message we should be expecting if we hit the max channels and I don't see that

2019-02-11 16:04:06.296 [error] <0.887.0> Error on AMQP connection <0.887.0> (127.0.0.1:49956 -> 127.0.0.1:5672, vhost: '/', user: 'guest', state: running), channel 23:
 operation none caused a connection exception not_allowed: "number of channels opened (22) has reached the negotiated channel_max (22)
epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-17T14:39:07Z


Got it. Thanks.

So the theory is that the broker was unhealth, and McAfee (OAS Manager, according to @top@) was interfering with it. Seems plausible. Just not sure how to prove it.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T14:57:55Z


We for sure have a leak though. Im actually surprised that we don't see the channel full message more often. It looks like we are almost to half our max channels after 2 days

!Screenshot%202024-04-17%20105611.png!

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T16:12:03Z


Hank wrote:

More logging is posted below.

I think this inability to create channels was what caused the COWRES to become seemingly unresponsive to posted evaluations and eventually requests time out. So we need to ensure posts don't timeout if channels fail to create.

Hank

================================================

[...]

Ok, actually it does seem related to max channels being reached from the leak. Before getting to the section of code responsible for this, I just looped through making 3000 channels and not closing them. I was left with 2 jobs that were hanging during submission and the following stack trace:

 WARNING: A class jakarta.activation.DataSource for a default provider MessageBodyWriter<jakarta.activation.DataSource> was not found. The prov
ider is not available.
wres-tasker-1        | Apr 17, 2024 4:07:38 PM org.glassfish.jersey.server.wadl.WadlFeature configure
wres-tasker-1        | WARNING: JAX-B API not found . WADL feature is disabled.
wres-tasker-1        | 2024-04-17T16:07:38.549+0000 INFO WresJob ==========> REQUEST POSTED: verb = 'execute'; postInput = false; additional arguments = '[]'.
wres-tasker-1        | 2024-04-17T16:07:38.558+0000 INFO DeclarationValidator Encountered the following declaration string:
wres-tasker-1        | ---
wres-tasker-1        | label: web demo
wres-tasker-1        | observed: /mnt/wres_share/systests/data/DRRC2QINE.xml
wres-tasker-1        | predicted:
wres-tasker-1        |   label: HEFS
wres-tasker-1        |   sources: /mnt/wres_share/systests/data/drrc2ForecastsOneMonth/
wres-tasker-1        | unit: m3/s
wres-tasker-1        | lead_times:
wres-tasker-1        |   minimum: 0
wres-tasker-1        |   maximum: 48
wres-tasker-1        |   unit: hours
wres-tasker-1        | probability_thresholds:
wres-tasker-1        |   values: [0.002, 0.01, 0.1, 0.9, 0.99, 0.998]
wres-tasker-1        |   operator: greater equal
wres-tasker-1        | metrics:
wres-tasker-1        |   - quantile quantile diagram
wres-tasker-1        |   - rank histogram
wres-tasker-1        |   - relative operating characteristic diagram
wres-tasker-1        |   - box plot of errors by forecast value
wres-tasker-1        |   - root mean square error
wres-tasker-1        |   - sample size
wres-tasker-1        |   - mean absolute error
wres-tasker-1        |   - box plot of errors by observed value
wres-tasker-1        |   - reliability diagram
wres-broker-1        | 2024-04-17 16:07:38.923789+00:00 [info] <0.4628.0> accepting AMQP connection <0.4628.0> (172.19.254.72:53060 -> 172.19.254.68:5671)
wres-broker-1        | 2024-04-17 16:07:38.927443+00:00 [info] <0.4628.0> connection <0.4628.0> (172.19.254.72:53060 -> 172.19.254.68:5671): user 'wres-tasker' authe
nticated and granted access to vhost 'wres'
wres-broker-1        | 2024-04-17 16:07:38.956352+00:00 [info] <0.4651.0> accepting AMQP connection <0.4651.0> (172.19.254.72:53064 -> 172.19.254.68:5671)
wres-broker-1        | 2024-04-17 16:07:38.959538+00:00 [info] <0.4651.0> connection <0.4651.0> (172.19.254.72:53064 -> 172.19.254.68:5671): user 'wres-tasker' authe
nticated and granted access to vhost 'wres'
wres-tasker-1        | 2024-04-17T16:07:38.977+0000 INFO JobResults Watching queue job.8417949618631924206.EXITCODE for the exit code result of the evaluation.
wres-tasker-1        | 2024-04-17T16:07:38.977+0000 INFO JobResults Watching the queue job.8417949618631924206.STDOUT for STDOUT logging.
wres-tasker-1        | 2024-04-17T16:07:38.978+0000 INFO JobResults Watching the queue job.8417949618631924206.STDERR for STDERR logging.
wres-tasker-1        | 2024-04-17T16:07:38.981+0000 INFO JobResults Looking for exit code on topic job.8417949618631924206.EXITCODE
wres-tasker-1        | 2024-04-17T16:07:38.981+0000 INFO JobOutputWatcher Watching the queue job.8417949618631924206.OUTPUT for job output.
wres-tasker-1        | 2024-04-17T16:07:39.868+0000 WARN JobResults A throwable was received while watching for messages in a queue.
wres-tasker-1        | java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
wres-tasker-1        |  at wres.tasker.JobStatusWatcher.run(JobStatusWatcher.java:116)
wres-tasker-1        |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
wres-tasker-1        |  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
wres-tasker-1        |  at java.base/java.lang.Thread.run(Thread.java:840)
wres-graphics-1      | 2024-04-17T16:08:28.420+0000 INFO MessagingClient EvaluationMessager subscriber IeTe59HcgK9IARY9T_3yEPlXxvQ is waiting for work. Until now, re
ceived 0 packets of statistics across 0 evaluations.
wres-writing-1       | 2024-04-17T16:08:28.428+0000 INFO MessagingClient EvaluationMessager subscriber b-ibaT_LycK5It2MNtsWMNZWvI8 is waiting for work. Until now, re
ceived 0 packets of statistics across 0 evaluations.
wres-tasker-1        | 2024-04-17T16:08:38.991+0000 INFO JobMessageHelper Still waiting for job 8417949618631924206 STDOUT
wres-tasker-1        | 2024-04-17T16:08:38.992+0000 INFO JobMessageHelper Still waiting for job 8417949618631924206 STDERR
wres-tasker-1        | 2024-04-17T16:08:38.992+0000 INFO JobMessageHelper Still waiting for job 8417949618631924206 OUTPUT
wres-tasker-1        | 2024-04-17T16:08:39.487+0000 INFO RequestLog 172.19.254.65 - - "GET / HTTP/1.1" 302 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebK
it/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
wres-tasker-1        | 2024-04-17T16:08:39.490+0000 INFO RequestLog 172.19.254.65 - - "GET /index.html HTTP/1.1" 200 2317 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x
64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
wres-tasker-1        | 2024-04-17T16:08:44.024+0000 INFO WresJob ==========> REQUEST POSTED: verb = 'execute'; postInput = false; additional arguments = '[]'.
wres-tasker-1        | 2024-04-17T16:08:44.025+0000 INFO DeclarationValidator Encountered the following declaration string:
wres-tasker-1        | ---
wres-tasker-1        | label: web demo
wres-tasker-1        | observed: /mnt/wres_share/systests/data/DRRC2QINE.xml
wres-tasker-1        | predicted:
wres-tasker-1        |   label: HEFS
wres-tasker-1        |   sources: /mnt/wres_share/systests/data/drrc2ForecastsOneMonth/
wres-tasker-1        | unit: m3/s
wres-tasker-1        | lead_times:
wres-tasker-1        |   minimum: 0
wres-tasker-1        |   maximum: 48
wres-tasker-1        |   unit: hours
wres-tasker-1        | probability_thresholds:
wres-tasker-1        |   values: [0.002, 0.01, 0.1, 0.9, 0.99, 0.998]
wres-tasker-1        |   operator: greater equal
wres-tasker-1        | metrics:
wres-tasker-1        |   - quantile quantile diagram
wres-tasker-1        |   - rank histogram
wres-tasker-1        |   - relative operating characteristic diagram
wres-tasker-1        |   - box plot of errors by forecast value
wres-tasker-1        |   - root mean square error
wres-tasker-1        |   - sample size
wres-tasker-1        |   - mean absolute error
wres-tasker-1        |   - box plot of errors by observed value
wres-tasker-1        |   - reliability diagram
wres-tasker-1        | 2024-04-17T16:08:44.052+0000 WARN JobResults A throwable was received while watching for messages in a queue.
wres-tasker-1        | java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
wres-tasker-1        |  at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:731)
wres-tasker-1        |  at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:646)
wres-tasker-1        |  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
wres-tasker-1        |  at java.base/java.lang.Thread.run(Thread.java:840)
wres-tasker-1        | 2024-04-17T16:08:44.052+0000 WARN JobResults A throwable was received while watching for messages in a queue.
wres-tasker-1        | java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
wres-tasker-1        |  at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:731)
wres-tasker-1        |  at wres.tasker.JobResults$StandardStreamWatcher.call(JobResults.java:646)
wres-tasker-1        |  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
wres-tasker-1        |  at java.base/java.lang.Thread.run(Thread.java:840)
wres-tasker-1        | 2024-04-17T16:08:44.052+0000 WARN JobResults A throwable was received while watching for messages in a queue.
wres-tasker-1        | java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
wres-tasker-1        |  at wres.tasker.JobOutputWatcher.run(JobOutputWatcher.java:104)
wres-tasker-1        |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
wres-tasker-1        |  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
wres-tasker-1        |  at java.base/java.lang.Thread.run(Thread.java:840)
wres-tasker-1        | 2024-04-17T16:08:44.053+0000 WARN JobResults A throwable was received while watching for messages in a queue.
wres-tasker-1        | java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
wres-tasker-1        |  at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:395)
wres-tasker-1        |  at wres.tasker.JobResults$JobResultWatcher.call(JobResults.java:313)
wres-tasker-1        |  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
wres-tasker-1        |  at java.base/java.lang.Thread.run(Thread.java:840)
wres-tasker-1        | 2024-04-17T16:08:44.053+0000 WARN JobResults A throwable was received while watching for messages in a queue.
wres-tasker-1        | java.lang.NullPointerException: Cannot invoke "com.rabbitmq.client.Channel.exchangeDeclare(String, String, boolean)" because "channel" is null
wres-tasker-1        |  at wres.tasker.JobStatusWatcher.run(JobStatusWatcher.java:116)
wres-tasker-1        |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
wres-tasker-1        |  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
wres-tasker-1        |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
wres-tasker-1        |  at java.base/java.lang.Thread.run(Thread.java:840)
epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T16:12:51Z


These two jobs then are just stuck

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T16:49:48Z


Ive added code to close the channel in the finally block to clean up the leak:

        finally
        {
            try
            {
                if ( channel != null )
                {
                    channel.close();
                }

I also added this code to the try block to deal with null response if we get one:

            if ( Objects.isNull( channel ) )
            {
                LOGGER.warn( "Channel was unable to be created. There might be a leak" );
                return;
            }
epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-04-17T16:54:43Z


Can't you use a try-with-resources, which is clearer/cleaner? Anything that is @Autocloseable@ should be put in a try-with-resources at the moment of creation, by default (or documented as to why it isn't appropriate, such as a delayed pull/close).

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T17:15:42Z


James wrote:

Can't you use a try-with-resources, which is clearer/cleaner? Anything that is @Autocloseable@ should be put in a try-with-resources at the moment of creation, by default (or documented as to why it isn't appropriate, such as a delayed pull/close).

In a finally we are deleting queues which we need to have access to the channel in order to do that and a try-with-resource removes that capability. I am not sure if closing the actual connection with remove the need for this going forward though

    finally
            {
                if ( ( queueName != null ) && ( channel != null ) )
                {
                    try
                    {
                        LOGGER.info( "Deleting the queue {}", queueName );
                        AMQP.Queue.DeleteOk deleteOk = channel.queueDelete( queueName );

I don't have the cert set up on my local box to see if the queues get removed properly

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T17:16:49Z


I suppose I can test this in staging if we are dong with 3.0 testing for the time being

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T18:05:37Z


I did some research and if we set the queue to be auto-delete then deleting the channel will remove the queue:

auto-delete

If set, the queue is deleted when all consumers have finished using it. The last consumer can be cancelled either explicitly or because its channel is closed. 
If there was no consumer ever on the queue, it won't be deleted. Applications can explicitly delete auto-delete queues using the Delete method as normal.
epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T18:10:09Z


We actually dont have any of the old channel creations being closed prior. I am going to clean them all up now

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T18:21:50Z


I just looked at the commit history and it looks like we actually removed this from being a try-with-resource in the past:

https://vlab.noaa.gov/redmine/projects/wres/repository/revisions/1ffecf6aeb653267809322c42de36847197a1c31/diff/wres-tasker/src/wres/tasker/JobOutputWatcher.java

Might be a shot in the dark, but Hank do you recall every trying making the queue auto-delete?

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T18:22:26Z


I am just going to go ahead with the auto-delete channel implementation and test in staging if that also deletes the queues (It should) and then we can move back to the try-with-resource

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-17T18:44:19Z


auto-delete? No. However, if its looks like I removed the try-with-resources, so not ensuring things were closed is on me. Does the commit message say anything about why I would remove it?

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-17T18:48:25Z


Looks like I was trying to fix the zombies and did something that would ensure queues were deleted, but never closed the channel. Oops.

I'm not sure if Jesse pursued the auto-delete option. However, if we put that change in place, we'll need to do some thorough testing under various conditions of persister/tasker/broker/worker break downs in communication. For example making sure no zombies are left over when taking down just the tasker, taking down just the broker, taking down just the worker, stop and restarting all 3 nearly at once but varying the order of both how they are started and how they are stopped, and so forth. It could take a few hours just to look at every scenario we can think of.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T18:51:28Z


Hank wrote:

Looks like I was trying to fix the zombies and did something that would ensure queues were deleted, but never closed the channel. Oops.

I'm not sure if Jesse pursued the auto-delete option. However, if we put that change in place, we'll need to do some thorough testing under various conditions of persister/tasker/broker/worker break downs in communication. For example making sure no zombies are left over when taking down just the tasker, taking down just the broker, taking down just the worker, stop and restarting all 3 nearly at once but varying the order of both how they are started and how they are stopped, and so forth. It could take a few hours just to look at every scenario we can think of.

Hank

With that in mind do we just want to stick with our current set up and add the channel close into the finally block and not explore auto-delete for the time being?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-04-17T18:53:35Z


Agree that it would need to be tested and also justified by the use case. Auto-delete queues are intended for a use case that requires a queue as long as the connection is alive, i.e., temporarily, and will need to be re-established when connections are dropped, for example. If that is consistent with the use case, fine.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T18:55:15Z


For the time being then I will just continue with the approach done above and create a comment/leave a ticket to investigate auto delete queues so that we can create the channels in a try-with-resource

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-04-17T18:57:59Z


Sharing my thoughts even though you all typed much faster. :)

========

Evan,

My opinion depends on whether we have a means to reproduce the issue in staging. Now that we believe we know the cause, can we create some sort of evaluation shape or perhaps just raw quantity to force the crash?

If we can, then I see value in just testing the channel close, first, before testing the auto-delete. That way we can see the impact of each independently.

If we can't, then its your call. If you can convince yourself the auto-delete works, go for it. Otherwise, just add back in channel closing.

Hopefully, this make senses. :)

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T19:58:59Z


Hank wrote:

My opinion depends on whether we have a means to reproduce the issue in staging. Now that we believe we know the cause, can we create some sort of evaluation shape or perhaps just raw quantity to force the crash?

We could reproduce in staging by spamming a ton of jobs, but we can simply see this is resolved by the fact that our channels open doesn't continually rise as more jobs come in.

If we can, then I see value in just testing the channel close, first, before testing the auto-delete. That way we can see the impact of each independently.

agreed

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T20:13:23Z


Ok, I think that I have stuff set up decently, however I am trying to figure out a good way to present the null use case to the user. Currently I have wrapped the NPE, however from a user perspective it just hangs on the job submission screen (Which is not good behavior). Trying to find the proper way to naviagate the user to a screen that lets them know why their job failed

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T20:14:06Z


Alternatively we can just leave this as the current behavior (Since cleaning the leak should remove this behavior scenario all together)

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-17T20:23:35Z


I will go ahead and push up the leak closing code, but I will revisit the client behavior tomorrow

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-18T15:01:01Z


Hank deployed to staging earlier and it seems like the channels are not increasing over time. We can talk about in the sync today how we want the client facing behavior to be, but at least the leak is cleaned up

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-19T12:37:42Z


Confirmed today the leak has been closed

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-22T15:10:56Z


It makes sense to me that we would wait to fail out of submitting a job if the broker queue for the JobResults fails, but if any of the broker queue fails, do we want to cancel the job submission?

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-22T18:11:21Z


I have it so we go to an error page now, trying to find the best way though to kill channels that might have been created. IE if we are able to spin up the status queue and the exit code queue, but none of the other ones, then we will fail out their job to an error page, but we are still waiting to close out those other two queues.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-22T19:51:21Z


Alright, unless someone else has input here is what I have in place:

Failure to create a broker for the status, results, or output of a job causes the job to fail out.

Failure of the stderr or stdout brokers do not stop a job from continuing.

Added in the following code that allows 30 seconds to establish all of the connections and if they fail to do so then we will tear down the threads which will in turn kill the queues.

            boolean await = latch.await( 30, TimeUnit.SECONDS );

            if ( !await )
            {
                JOB_RESULTS.shutdownNow();
                LOGGER.warn( "Some Channel was not able to be created. There might be a leak" );
                throw new IOException( "Unable to connect to broker" );
            }
epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-22T20:07:17Z


Change here https://vlab.noaa.gov/redmine/projects/wres/repository/revisions/5290fb7a4004867c8cdd105914181c1de6b62ec7

This should be good now

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-22T20:08:07Z


UAT:

Run a handful of jobs in staging and check in RABBITMQ monitor that the channels in the tasker are not increasing over time

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-04-22T20:32:42Z


I assume you mean queues, right (not brokers)? There are separate queues for various activities and the failure to create one or more queues results in an eventual exception.

My only advice is that you do not wait 30 seconds unconditionally. It's unclear whether that is the case, but we don't want an artificial latency of 30 seconds when spinning up the service. Once all queues are confirmed created, then there should be no artificial delay.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-04-22T21:32:54Z


James wrote:

I assume you mean queues, right (not brokers)? There are separate queues for various activities and the failure to create one or more queues results in an eventual exception.

I am referring to the image in #128980-10 these channels maxing out it what causes the original NPE. That is the leak that I cleaned up and what the person that UAT's this should check is not increasing. (Not sure if that was what you were pointing out)

My only advice is that you do not wait 30 seconds unconditionally. It's unclear whether that is the case, but we don't want an artificial latency of 30 seconds when spinning up the service. Once all queues are confirmed created, then there should be no artificial delay.

It does not wait 30 seconds unconditionally, the await will pass once all the latches are removed (Which is very much under 30 seconds). That timer is for a max timeout of waiting for the latches. So no latency added here, just a way for us to catch if one of connections was not created.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-04-22T21:43:39Z


Sounds good. I was referring to #128980-35, but I think it's just a couple of typos with brokers vs. queues.