As a developer, if the core server completes an evaluation, I don't want the user to receive a failure because of a timeout during relaying outputs to the shim.

HankHerr-NOAA commented 2 weeks ago

See VLab #120579-238. The evaluation was pretty small and fast with little stdout and little output. In the stdout log, I saw this:

2024-10-04T17:15:08.530+0000  [pool-2-thread-1] INFO EvaluationService - Evaluation with internal ID 1901617939395097451 and evaluation ID of P6YTi1bggENyDcwc19rpTFNYIwY has returned
2024-10-04T17:15:08.876+0000  [Thread-9] WARN EvaluationService - There was not an expected status transition; timeout thread closing project in status COMPLETED
2024-10-04T17:15:08.876+0000  [Thread-9] INFO EvaluationService - Closing Evaluation

Note the warning. The status endpoint returns a failure (better messaging, above, about that failure would be nice, since it was just a warning).

In the shim, I saw this exception (in context):

2024-10-04T17:15:08.880+0000  [pool-4-thread-1] INFO JobStandardStreamMessenger - Finished sending STDOUT for job 6948032875964178351
2024-10-04T17:15:09.325+0000  [main] WARN JobStatusMessenger - Failed to create channel, declare exchange, or get status.
java.io.IOException: Unable to get status with given ID
    at wres.worker.JobStatusMessenger.getEvaluationStatus(JobStatusMessenger.java:175)
    at wres.worker.JobStatusMessenger.run(JobStatusMessenger.java:136)
    at wres.worker.WresEvaluationProcessor.executeEvaluation(WresEvaluationProcessor.java:257)
    at wres.worker.WresEvaluationProcessor.call(WresEvaluationProcessor.java:184)
    at wres.worker.Worker.main(Worker.java:141)
2024-10-04T17:15:09.327+0000  [main] INFO JobStatusMessenger - Finished sending STATUS for job 6948032875964178351

Evan postulated that something slowed the ability the shim to get the final status update and outputs from the core server within a set time out, and that led to an exception which lead to the status failure being returned to the user.

The problem is not easily reproducible. The exact same evaluation completed successfully minutes later and I have not seen that exception, or any related exception like that (i.e., representing a shim-server breakdown), in months.

I'll add a comment with notes from Evan and James in the ticket after posting,

Hank

HankHerr-NOAA commented 2 weeks ago

From VLab...

Evan noted this:

What the issue looks like to me based on your description is the following:
- I think the communication error is a red harring, because if it was the root it would happen before the second wrning you sent.
- We have a timer in place between when a project is completed before it is closed to ensure unclosed projects don't perpetually hang a worker
- If that timer is triggered we close the project

The project being closed mid read of the results SHOULDNT cause any issue, but that is looking like what happened. I can look more into this next week, but we may need to extend the errant job timeout to allow for more time before forecably closing an evaluation

James responded:

If that is the cause, then I don't think the solution can be additional time - because there will always be a bigger evaluation than the arbitrary wait, and the wait cannot be indefinite - it must be to wait until all outputs have been served when the serving is in progress, in other words an event-based trigger, not a time trigger.

Thanks,

Hank

epag commented 2 weeks ago

To your comment James I guess we could do the following:

Start a timer when an evaluation completes (lets say 3 minutes)
Stop that timer when the call to get the results of an evaluation is started
Start a new timer when the evaluation results call returns (3 minutes again)
stop that timer when the close call is made

This way the time is truly just monitoring the state transitions and allow unlimited time for the output location to be retrieved.

I am going to test though if this is indeed what is causing it by reducing the time amount locally and seeing if I can reproduce

NOAA-OWP / wres

As a developer, if the core server completes an evaluation, I don't want the user to receive a failure because of a timeout during relaying outputs to the shim. #330