Open james-d-brown opened 1 month ago
See #336 and #24, for example.
Having the server respond with a correct (4xx) error code for invalid declaration is one thing and that is largely an exercise in whac-a-mole because the underlying snakeyaml parser does not advertise the (many) unchecked exceptions it throws. That isn't the focus of this ticket.
A second issue, and the focus here, is having the validation go through the worker front door so that we get a simple/consistent pipeline without a short-circuit. Currently, there is a short-circuit in the WresJob
class of the Tasker
, so the Worker
never sees this traffic.
Why was this done? Possibly to ensure responsive validation from callers that are sensitive to response times, like the GUI. On the other hand, I think we need to demonstrate a lack of responsiveness as material to the user experience before we continue with this approach because there are downsides both for developers and users. The downside for users is they don't get a clear trace of a job that fails declaration validation because no job is created at that stage (or ever, on the core app side). The downside for developers is spiderweb code with short-circuits.
The goal would be to further justify the current approach, and unless clearly justified, replace with a simpler approach whereby all traffic goes via a worker/core app and we add this behavior to the WresEvaluationProcessor
(currently misnamed because it processes more than evaluations) and the EvaluationService
on the core app side, which further calls the relevant Functions::validate
.
This may require that we address other underlying problems of provisioning, such as all workers being busy and unable to process (quick) validation jobs at any given time, but I question whether this is any different than waiting for an evaluation more generally. Afterall, users that validate projects are not different than users that conduct evaluations. If users need to validate projects many times during preparation, fair enough. In that case, we may need additional workers or even dedicated workers for particular activities, but that is probably a step too far.
Simple solution: don't add workers until such time as a problem becomes concrete, and then add workers, and then finally add special snowflake validation workers if the problem persists. But this may be a situation where fear got ahead of reality, tbd. Either way, there are real downsides of the short-circuit in the current approach, both for developers and users.
As noted in https://github.com/NOAA-OWP/wres/issues/336#issuecomment-2416572601, there is a concern that validation tasks need to wait for busy workers processing long evaluations. The main issue I see with this argument is that the validated evaluation needs to wait for those workers too. Is it really worse to wait for validation than evaluation? Questionable.
If the flow of a caller, like the GUI, makes this particularly cumbersome, then it probably needs to be addressed there (e.g., no indefinite spinning of wheels awaiting a call response). If users are waiting long periods, we need more workers, fullstop, and I am not really convinced that the particular flavour of waiting (evaluation vs. validation) is important, unless there is something wrong with the caller, such as spinning wheels on regular validation calls.
I see you acknowledged, above, the issues I was reported in my comment to the other ticket. Thanks.
First, the validation endpoint was added to the tasker to support the WRES GUI, in particular. Specifically, to allow the WRES GUI to validate a declaration either with a user click or before posting an evaluation to the COWRES, without having to potentially wait for a worker to become free to respond.
Later, we added the functionality of validating before sending the evaluation to a worker. Again, we didn't want a worker to be hogged by an evaluation that fails validation, and we didn't want an evaluation that fails validation to wait for a worker to be free in order to let the user know its bad YAML.
There may be better paths to make this happen, preferably one that still allows for a job id to be assigned, but whatever solution we put in place should not run the risk of waiting for minutes or hours for the COWRES to respond on to say that there is a typo in the YAML. This was a real problem when I established the validation endpoint, in the first place, so it will almost certainly become a real problem again if we just blindly move it back to the workers.
My two cents,
Hank
Again, I don't buy that argument. If a user is waiting for long periods, then there is a different problem, not a problem with the responsiveness of the validation endpoint, rather a problem with provisioning. The current approach isn't working well for users who expect some traceability when evaluations fail validation. It also isn't working well for developers.
It would be better not to set up a special worker for non-evaluation tasks whereby a worker doesn't ack a non-evaluation job unless it is a non-evaluation worker, but that would still be better than the current approach.
So my suggestion would be:
Being a user, I can absolutely tell you that waiting for a validation failure is annoying. However, you're right about it being "a different problem" at its core. That is, we wouldn't need the special validation code in the tasker if there was always a worker available to handle that validation. So, if we can make that happen, great. And that gets me to your suggestion...
3 makes sense to me. It would essentially result in a worker that can handle the validation currently handled by the tasker, but would allow it to create a job id for that work and also avoid having the tasker make calls to worker code. I suppose it would mean that a clean of the "other" database would no longer have to wait in line, which would also be good.
Sounds good to me,
Hank
Right, but waiting hours for anything is extremely annoying, so we should address the root cause.
To be clear, the sequence above was a sequence, rather than a set of alternatives. In other words we start with (1), proceed to (2) and then get to (3) when we are convinced we really need it. We're getting better at monitoring now so we should be able to track and plot the distribution of enqueued jobs by wait time. I am not against making things more complex for a better user experience if needed, but emphasis on the "if needed", which needs evidence. I accept your earlier anecdotes of this being a pain, but I think (1) will largely address that problem and it makes sense to do it before (2).
Given a request to the web service validation endpoint in cluster mode (i.e., via the tasker) When that request is received Then it should go via the front door of the worker (core app), meaning
Functions::validate