NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, if an evaluation is picked up by a worker where the tasker's watcher queues are not present, then don't bother the run the evaluation #173

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: Hank (Hank) Original Redmine Issue: 131491, https://vlab.noaa.gov/redmine/issues/131491 Original Date: 2024-06-13


Repurposed. See comments, below. If a worker picks up an evaluation, but there is nothing watching for its results and output, then there is no reason for the worker to perform that evaluation.

The original description is below. Thanks,

Hank

=========================================================== ORIGINAL DESCRIPTION:

This could be particularly bothersome if the extra evaluations each take hours to complete.

While we have an identified mechanism to force the tasker to no longer "watch" for an evaluation to complete (post a blank message to @exitCode@ and the tasker will treat it as a failed evaluation), we don't have a means to remove those evaluations from the broker's @wres.job@ queue. That means, the evaluations will still be grabbed by workers and processed. Evidence for that is in production, now, where, upon turning on the GUI scheduler, a bunch of scheduled evaluations that were queued up were all posted at once. I used the @exitCode@ method to remove the jobs from the tasker side, but did not remove them from @wres.job@, and so they were still processed by workers later. I'll include evidence showing what happened in comment 1.

This ticket can be resolved once we have identified a means to clear IN_QUEUE evaluations from the broker's @wres.job@ queue or otherwise make it so that the workers don't bother to process evaluations for which watcher queues are no longer present.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:23:20Z


The WRES GUI posted a few dozen evaluations immediately when the scheduler turned back on. I removed the watcher queues (established by the tasker) for all but 4 of those evaluations, so that all are reported as failures by the tasker. The Postgres database reports this:

 RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | EnSo9sorJ83jpDIRFetY7TUI93s | execute lab | f      | 2024-06-07 13:00:21.849076 | 2024-06-07 13:58:59.230247 |     58.62301951646805
 RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | 862BI75ucJAESeDKNdNagvaWHZU | execute lab | f      | 2024-06-07 13:00:21.911094 | 2024-06-07 13:59:30.375609 |     59.14107524951299
 RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | 7qznfP17vF74SpFvtfMqZxzjRbk | execute lab | f      | 2024-06-07 13:00:22.096308 | 2024-06-07 13:52:12.387063 |     51.83817925055822
 RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | DooHC4bonN1UHTDmbffPAtKjWa4 | execute lab | f      | 2024-06-07 13:00:22.112235 | 2024-06-07 13:59:31.330638 |    59.153640047709146
 RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | oyU873g-WBqVWW8VL41JKBppaQc | execute lab | f      | 2024-06-07 13:27:02.217832 | 2024-06-07 14:00:36.300868 |     33.56805059909821
 RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | WzcFnXhIVW1jXxNXaMBEwkgc7PI | execute lab | f      | 2024-06-07 13:52:14.616349 | 2024-06-07 14:23:27.802306 |     31.21976594924927
 RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | AJth-H4lSPUsm-L4xCW--_-fupc | execute lab | f      | 2024-06-07 13:59:02.230789 | 2024-06-07 14:54:19.108584 |     55.28129658301671
 RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | BnGG8w9SBNNxkoMMiX9g7uNWEkA | execute lab | f      | 2024-06-07 13:59:33.292536 | 2024-06-07 14:56:09.854053 |     56.60935861666997
 RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | gIGI3itDloLEsG20oBv1fkQikb0 | execute lab | f      | 2024-06-07 13:59:36.105116 | 2024-06-07 14:56:15.739332 |    56.660570267836256
 RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | UxGx2xE8bHJcDtdTjWvC43_eOy0 | execute lab | f      | 2024-06-07 14:00:38.481486 | 2024-06-07 14:50:15.724607 |    49.620718681812285
 RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | Qg2osXCcMMEUnxn0gxXAnUgBrAU | execute lab | f      | 2024-06-07 14:23:30.167419 | 2024-06-07 14:57:00.157531 |     33.49983520110448
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | 0F71BTwyCHTtKl8J8ALO9uJ5Zhg | execute lab | f      | 2024-06-07 14:50:17.796478 | 2024-06-07 15:32:09.567772 |     41.86285489797592
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | YXSJNHpVbCw3ma1anNkjUBK2HLk | execute lab | f      | 2024-06-07 14:54:22.34396  | 2024-06-07 15:47:16.587197 |     52.90405395030975
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | zWB9a8uzcTR3YgNweNRPyWABwcg | execute lab | f      | 2024-06-07 14:56:13.767155 | 2024-06-07 15:52:36.046408 |     56.37132088343302
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | Tt3A3bhg39GJl4ROcY7_8RMcyw0 | execute lab | f      | 2024-06-07 14:56:17.771013 | 2024-06-07 15:52:38.047572 |     56.33794264793396
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | MYij626DELnbCh-g78_za4JCmos | execute lab | f      | 2024-06-07 14:57:01.364324 | 2024-06-07 15:47:09.439076 |    50.134579197565714
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | CoO0yFuqM3QY5nXWxNWQ83-srdI | execute lab | f      | 2024-06-07 15:32:11.481639 | 2024-06-07 16:00:09.364552 |    27.964715218544008
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | zxTcFAXwRjUXjHQWOd8led-ye0A | execute lab | f      | 2024-06-07 15:47:11.29719  | 2024-06-07 16:29:03.192014 |     41.86491373380025
 RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | E2yA1U-nWAzB2i5HTjP5qY6C-OA | execute lab | f      | 2024-06-07 15:47:20.975943 | 2024-06-07 16:32:58.541721 |     45.62609630028407
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | bs4iBTwuDmmEPH6H6q6cr60AicE | execute lab | f      | 2024-06-07 15:52:38.444504 | 2024-06-07 16:40:42.244788 |      48.0633380651474
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | gWsRbeCjFVVUXOuEDQmalZIVYwQ | execute lab | f      | 2024-06-07 15:52:40.19256  | 2024-06-07 16:41:05.606526 |     48.42356609900792
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | FrqDlAt-VV-Umh2WVODfaN1l0Ls | execute lab | f      | 2024-06-07 16:00:11.257448 | 2024-06-07 16:38:15.130065 |      38.0645436167717
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | gQKPUiTO0DqeNjFmAVzAdBMDB4k | execute lab | f      | 2024-06-07 16:29:06.44508  | 2024-06-07 16:49:51.664579 |    20.753658314545948
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | _XlM4LCsOqtpJbbxQu_4Zlrag5A | execute lab | f      | 2024-06-07 16:33:02.140836 | 2024-06-07 17:10:13.156477 |     37.18359401623408
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | zx3zSAmzd8tlX78PkNzR9bC_E7g | execute lab | f      | 2024-06-07 16:38:17.459477 | 2024-06-07 17:14:57.896594 |      36.6739519516627
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | QJr1Jl_SzSsAQIAeqk8MjlAHBPQ | execute lab | f      | 2024-06-07 16:40:44.967995 | 2024-06-07 17:18:43.166174 |     37.96996965010961
 RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | F26lwuWND3XqEHTfQidA7wvstXE | execute lab | f      | 2024-06-07 16:41:08.058609 | 2024-06-07 17:18:46.917754 |     37.64765241543452

Every evaluation was still processed, even though the tasker stopped watching for them when I removed the watcher queues.

Thanks,

Hank

========

P.S. An aside... The reason why the declaration hashes appear to repeat a few times is due to how the WRES GUI handles the scheduled evaluation: it appears to fill in the date information when the scheduler hands off the evaluation for execution. Hence, the first 4 evaluations below all used this date range:

reference_dates:
  minimum: '2024-05-28T00:00:00Z'
  maximum: '2024-06-07T12:58:21Z'
valid_dates:
  minimum: '2024-05-28T00:00:00Z'
  maximum: '2024-06-07T12:58:21Z'
</code>

The next batch used a range ending a second later:

reference_dates:
  minimum: '2024-05-28T00:00:00Z'
  maximum: '2024-06-07T12:58:22Z'
valid_dates:
  minimum: '2024-05-28T00:00:00Z'
  maximum: '2024-06-07T12:58:22Z'
</code>

And so on.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T18:41:32Z


This is probably a duplicate of another ticket that extends our cancel functionality from the low-level worker api, where it is working fine, to the top-level service api. I think the way we do that is to broadcast cancellation to all workers with a specific job id and the worker running that evaluation cancels it, the others ignore it. But I think this is a duplicate ticket.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:46:19Z


This may be covered by that other ticket, but this is a little bit different, in that the evaluation to cancel is IN_QUEUE. There is no worker processing it. However, let me find that ticket, and, if we want to track the cancelation of both in-queue and on-going evaluations in a single ticket, I can reject this one.

Hank

EDIT: Fixed typo.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T18:48:58Z


OK, but cancel means cancel, so the other ticket needs to encapsulate that requirement, i.e., cancel regardless of evaluation stage.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-06-13T18:49:32Z


I think we have have seperate tickets because I think this is different enough in the sense that "Wether or not we cancel an evaluation, we shouldn't process an entire evaluation if nothing is waiting for its output, exit, or streams"

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-06-13T18:50:45Z


Which I think extends beyond cancelation. If for some reason we start processing a job and notice there are no queues then there is not point in tying up the worker

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:51:12Z


I'm open to repurposing this ticket and making sure the other one includes evaluations in-queue. Just let me know and I'll make the changes.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:51:46Z


Actually, it sounds like you all are letting me know now. :)

Let me make the changes,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:58:55Z


Repurposed,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T19:02:10Z


Evan wrote:

I think we have have seperate tickets because I think this is different enough in the sense that "Wether or not we cancel an evaluation, we shouldn't process an entire evaluation if nothing is waiting for its output, exit, or streams"

I think cancel means cancel promptly too. There should be one route to stopping an evaluation, promptly, and that should be the cancel api. There may be several strands to its implementation, depending on the stage/stage of the evaluation, but the requirement is one requirement. At the same time, I think the software has bugs if there is an evaluation ongoing and nothing is anticipating its completion. But, insofar as a service admin needs to intervene and remove a job, regardless of state, that requirement is job cancellation.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T19:03:55Z


In short, I have no problem with a new ticket (this ticket) that proposes to fix a bug, but anything that involves the intervention of a service admin to remove a job promptly is part of the cancel ticket.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-06-13T19:06:02Z


I agree. This revealed a bug where hank deleted the queues waiting for jobs response but the evaluations still went on without checking to see if anything is waiting for its response first