Open HankHerr-NOAA opened 2 weeks ago
From VLab...
Evan noted this:
What the issue looks like to me based on your description is the following:
- I think the communication error is a red harring, because if it was the root it would happen before the second wrning you sent.
- We have a timer in place between when a project is completed before it is closed to ensure unclosed projects don't perpetually hang a worker
- If that timer is triggered we close the project
The project being closed mid read of the results SHOULDNT cause any issue, but that is looking like what happened. I can look more into this next week, but we may need to extend the errant job timeout to allow for more time before forecably closing an evaluation
James responded:
If that is the cause, then I don't think the solution can be additional time - because there will always be a bigger evaluation than the arbitrary wait, and the wait cannot be indefinite - it must be to wait until all outputs have been served when the serving is in progress, in other words an event-based trigger, not a time trigger.
Thanks,
Hank
To your comment James I guess we could do the following:
This way the time is truly just monitoring the state transitions and allow unlimited time for the output location to be retrieved.
I am going to test though if this is indeed what is causing it by reducing the time amount locally and seeing if I can reproduce
See VLab #120579-238. The evaluation was pretty small and fast with little stdout and little output. In the stdout log, I saw this:
Note the warning. The status endpoint returns a failure (better messaging, above, about that failure would be nice, since it was just a warning).
In the shim, I saw this exception (in context):
Evan postulated that something slowed the ability the shim to get the final status update and outputs from the core server within a set time out, and that led to an exception which lead to the status failure being returned to the user.
The problem is not easily reproducible. The exact same evaluation completed successfully minutes later and I have not seen that exception, or any related exception like that (i.e., representing a shim-server breakdown), in months.
I'll add a comment with notes from Evan and James in the ticket after posting,
Hank