conveyal / r5

Developed to power Conveyal's web-based interface for scenario planning and land-use/transport accessibility analysis, R5 is our routing engine for multimodal (transit/bike/walk/car) networks with a particular focus on public transit
https://conveyal.com/learn
MIT License
272 stars 71 forks source link

Reduce errors sent to broker #919

Closed ansoncfit closed 6 months ago

ansoncfit commented 6 months ago

When workers encounter an exception on a regional task, they send a String error message back to the broker: https://github.com/conveyal/r5/blob/v7.0/src/main/java/com/conveyal/r5/analyst/cluster/AnalysisWorker.java#L294-L300

Many workers sending many error messages at once can overwhelm the broker. If there is an exception, it will likely affect or invalidate all origins, so most of these messages are duplicative. Some ideas for revising worker behavior:

abyrd commented 6 months ago

It would be possible to limit error reporting on the worker side. But the workers are intended to report at most one batch (4*n_cores) of errored tasks before the backend stops delivering tasks to them. The fact that worker error reporting became a problem revealed an underlying problem in job cancellation (see #887 and #921).

Workers do not currently associate tasks with the batch they were received in, so the first suggestion would add a bit of complexity. The second suggestion would only involve keeping a set of errored jobs, which could be global to the worker or just one set local to each polling operation (I'm now realizing maybe this is what you meant by your first suggestion). This would be relatively simple, but it does add one more step to the reporting process (thus one more thing that can go wrong).

I'm inclined to just get the job cancellation working properly, in which case we should only ever get one batch of errors before an errored job stops. We could then reinforce it with the second suggestion (perhaps limited to the scope of a single polling operation) if we feel there's still risk to be managed.

abyrd commented 6 months ago

Closing this because we merged #918 to the default branch. In the future I'll try to more regularly set the associated issues on pull requests, which should auto-close them upon merge.