Open epag opened 2 months ago
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:23:20Z
The WRES GUI posted a few dozen evaluations immediately when the scheduler turned back on. I removed the watcher queues (established by the tasker) for all but 4 of those evaluations, so that all are reported as failures by the tasker. The Postgres database reports this:
RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | EnSo9sorJ83jpDIRFetY7TUI93s | execute lab | f | 2024-06-07 13:00:21.849076 | 2024-06-07 13:58:59.230247 | 58.62301951646805
RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | 862BI75ucJAESeDKNdNagvaWHZU | execute lab | f | 2024-06-07 13:00:21.911094 | 2024-06-07 13:59:30.375609 | 59.14107524951299
RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | 7qznfP17vF74SpFvtfMqZxzjRbk | execute lab | f | 2024-06-07 13:00:22.096308 | 2024-06-07 13:52:12.387063 | 51.83817925055822
RFC Training Example NWM Single-Valued | 229010997229da684ed9b8e006141636 | DooHC4bonN1UHTDmbffPAtKjWa4 | execute lab | f | 2024-06-07 13:00:22.112235 | 2024-06-07 13:59:31.330638 | 59.153640047709146
RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | oyU873g-WBqVWW8VL41JKBppaQc | execute lab | f | 2024-06-07 13:27:02.217832 | 2024-06-07 14:00:36.300868 | 33.56805059909821
RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | WzcFnXhIVW1jXxNXaMBEwkgc7PI | execute lab | f | 2024-06-07 13:52:14.616349 | 2024-06-07 14:23:27.802306 | 31.21976594924927
RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | AJth-H4lSPUsm-L4xCW--_-fupc | execute lab | f | 2024-06-07 13:59:02.230789 | 2024-06-07 14:54:19.108584 | 55.28129658301671
RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | BnGG8w9SBNNxkoMMiX9g7uNWEkA | execute lab | f | 2024-06-07 13:59:33.292536 | 2024-06-07 14:56:09.854053 | 56.60935861666997
RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | gIGI3itDloLEsG20oBv1fkQikb0 | execute lab | f | 2024-06-07 13:59:36.105116 | 2024-06-07 14:56:15.739332 | 56.660570267836256
RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | UxGx2xE8bHJcDtdTjWvC43_eOy0 | execute lab | f | 2024-06-07 14:00:38.481486 | 2024-06-07 14:50:15.724607 | 49.620718681812285
RFC Training Example NWM Single-Valued | 2ad134de244aac0b868159dae3655c64 | Qg2osXCcMMEUnxn0gxXAnUgBrAU | execute lab | f | 2024-06-07 14:23:30.167419 | 2024-06-07 14:57:00.157531 | 33.49983520110448
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | 0F71BTwyCHTtKl8J8ALO9uJ5Zhg | execute lab | f | 2024-06-07 14:50:17.796478 | 2024-06-07 15:32:09.567772 | 41.86285489797592
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | YXSJNHpVbCw3ma1anNkjUBK2HLk | execute lab | f | 2024-06-07 14:54:22.34396 | 2024-06-07 15:47:16.587197 | 52.90405395030975
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | zWB9a8uzcTR3YgNweNRPyWABwcg | execute lab | f | 2024-06-07 14:56:13.767155 | 2024-06-07 15:52:36.046408 | 56.37132088343302
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | Tt3A3bhg39GJl4ROcY7_8RMcyw0 | execute lab | f | 2024-06-07 14:56:17.771013 | 2024-06-07 15:52:38.047572 | 56.33794264793396
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | MYij626DELnbCh-g78_za4JCmos | execute lab | f | 2024-06-07 14:57:01.364324 | 2024-06-07 15:47:09.439076 | 50.134579197565714
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | CoO0yFuqM3QY5nXWxNWQ83-srdI | execute lab | f | 2024-06-07 15:32:11.481639 | 2024-06-07 16:00:09.364552 | 27.964715218544008
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | zxTcFAXwRjUXjHQWOd8led-ye0A | execute lab | f | 2024-06-07 15:47:11.29719 | 2024-06-07 16:29:03.192014 | 41.86491373380025
RFC Training Example NWM Single-Valued | dd86816744d8a8a383e09e293c4d9192 | E2yA1U-nWAzB2i5HTjP5qY6C-OA | execute lab | f | 2024-06-07 15:47:20.975943 | 2024-06-07 16:32:58.541721 | 45.62609630028407
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | bs4iBTwuDmmEPH6H6q6cr60AicE | execute lab | f | 2024-06-07 15:52:38.444504 | 2024-06-07 16:40:42.244788 | 48.0633380651474
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | gWsRbeCjFVVUXOuEDQmalZIVYwQ | execute lab | f | 2024-06-07 15:52:40.19256 | 2024-06-07 16:41:05.606526 | 48.42356609900792
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | FrqDlAt-VV-Umh2WVODfaN1l0Ls | execute lab | f | 2024-06-07 16:00:11.257448 | 2024-06-07 16:38:15.130065 | 38.0645436167717
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | gQKPUiTO0DqeNjFmAVzAdBMDB4k | execute lab | f | 2024-06-07 16:29:06.44508 | 2024-06-07 16:49:51.664579 | 20.753658314545948
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | _XlM4LCsOqtpJbbxQu_4Zlrag5A | execute lab | f | 2024-06-07 16:33:02.140836 | 2024-06-07 17:10:13.156477 | 37.18359401623408
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | zx3zSAmzd8tlX78PkNzR9bC_E7g | execute lab | f | 2024-06-07 16:38:17.459477 | 2024-06-07 17:14:57.896594 | 36.6739519516627
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | QJr1Jl_SzSsAQIAeqk8MjlAHBPQ | execute lab | f | 2024-06-07 16:40:44.967995 | 2024-06-07 17:18:43.166174 | 37.96996965010961
RFC Training Example NWM Single-Valued | 1498a6247b0fd2738f6744fe2fd125da | F26lwuWND3XqEHTfQidA7wvstXE | execute lab | f | 2024-06-07 16:41:08.058609 | 2024-06-07 17:18:46.917754 | 37.64765241543452
Every evaluation was still processed, even though the tasker stopped watching for them when I removed the watcher queues.
Thanks,
Hank
========
P.S. An aside... The reason why the declaration hashes appear to repeat a few times is due to how the WRES GUI handles the scheduled evaluation: it appears to fill in the date information when the scheduler hands off the evaluation for execution. Hence, the first 4 evaluations below all used this date range:
reference_dates:
minimum: '2024-05-28T00:00:00Z'
maximum: '2024-06-07T12:58:21Z'
valid_dates:
minimum: '2024-05-28T00:00:00Z'
maximum: '2024-06-07T12:58:21Z'
</code>
The next batch used a range ending a second later:
reference_dates:
minimum: '2024-05-28T00:00:00Z'
maximum: '2024-06-07T12:58:22Z'
valid_dates:
minimum: '2024-05-28T00:00:00Z'
maximum: '2024-06-07T12:58:22Z'
</code>
And so on.
Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T18:41:32Z
This is probably a duplicate of another ticket that extends our cancel functionality from the low-level worker api, where it is working fine, to the top-level service api. I think the way we do that is to broadcast cancellation to all workers with a specific job id and the worker running that evaluation cancels it, the others ignore it. But I think this is a duplicate ticket.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:46:19Z
This may be covered by that other ticket, but this is a little bit different, in that the evaluation to cancel is IN_QUEUE. There is no worker processing it. However, let me find that ticket, and, if we want to track the cancelation of both in-queue and on-going evaluations in a single ticket, I can reject this one.
Hank
EDIT: Fixed typo.
Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T18:48:58Z
OK, but cancel means cancel, so the other ticket needs to encapsulate that requirement, i.e., cancel regardless of evaluation stage.
Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-06-13T18:49:32Z
I think we have have seperate tickets because I think this is different enough in the sense that "Wether or not we cancel an evaluation, we shouldn't process an entire evaluation if nothing is waiting for its output, exit, or streams"
Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-06-13T18:50:45Z
Which I think extends beyond cancelation. If for some reason we start processing a job and notice there are no queues then there is not point in tying up the worker
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:51:12Z
I'm open to repurposing this ticket and making sure the other one includes evaluations in-queue. Just let me know and I'll make the changes.
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:51:46Z
Actually, it sounds like you all are letting me know now. :)
Let me make the changes,
Hank
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-13T18:58:55Z
Repurposed,
Hank
Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T19:02:10Z
Evan wrote:
I think we have have seperate tickets because I think this is different enough in the sense that "Wether or not we cancel an evaluation, we shouldn't process an entire evaluation if nothing is waiting for its output, exit, or streams"
I think cancel means cancel promptly too. There should be one route to stopping an evaluation, promptly, and that should be the cancel api. There may be several strands to its implementation, depending on the stage/stage of the evaluation, but the requirement is one requirement. At the same time, I think the software has bugs if there is an evaluation ongoing and nothing is anticipating its completion. But, insofar as a service admin needs to intervene and remove a job, regardless of state, that requirement is job cancellation.
Original Redmine Comment Author Name: James (James) Original Date: 2024-06-13T19:03:55Z
In short, I have no problem with a new ticket (this ticket) that proposes to fix a bug, but anything that involves the intervention of a service admin to remove a job promptly is part of the cancel ticket.
Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-06-13T19:06:02Z
I agree. This revealed a bug where hank deleted the queues waiting for jobs response but the evaluations still went on without checking to see if anything is waiting for its response first
Author Name: Hank (Hank) Original Redmine Issue: 131491, https://vlab.noaa.gov/redmine/issues/131491 Original Date: 2024-06-13
Repurposed. See comments, below. If a worker picks up an evaluation, but there is nothing watching for its results and output, then there is no reason for the worker to perform that evaluation.
The original description is below. Thanks,
Hank
=========================================================== ORIGINAL DESCRIPTION:
This could be particularly bothersome if the extra evaluations each take hours to complete.
While we have an identified mechanism to force the tasker to no longer "watch" for an evaluation to complete (post a blank message to @exitCode@ and the tasker will treat it as a failed evaluation), we don't have a means to remove those evaluations from the broker's @wres.job@ queue. That means, the evaluations will still be grabbed by workers and processed. Evidence for that is in production, now, where, upon turning on the GUI scheduler, a bunch of scheduled evaluations that were queued up were all posted at once. I used the @exitCode@ method to remove the jobs from the tasker side, but did not remove them from @wres.job@, and so they were still processed by workers later. I'll include evidence showing what happened in comment 1.
This ticket can be resolved once we have identified a means to clear IN_QUEUE evaluations from the broker's @wres.job@ queue or otherwise make it so that the workers don't bother to process evaluations for which watcher queues are no longer present.
Hank