Closed AetherUnbound closed 9 months ago
To address this, we should add a check for indexer worker availability in the health check endpoint (/
of the "server"'s API).
Though, I'm not sure how this would work. IIRC the workers are stopped when they aren't being used, right? If that's the case, then we'd need to add some kind of startup check for worker availability? :thinking:
They are stopped when not running an actual index job. I think the important piece here is adding some mechanism on the ingestion server which alerts when this particular error case occurs (rather than remain silent).
I see. Are you saying, rather than a healthcheck, the ingestion server should properly report connection errors to the workers when trying to run an index? As in, we need to raise an error here: https://github.com/WordPress/openverse/blob/697f62f01a32cb7fcf2f4a7627650a113cba40da/ingestion_server/ingestion_server/distributed_reindex_scheduler.py#L46-L50
It's quite odd indeed that it just returns false, which is never handled!
Yes! We can use that wait for healthcheck logic for sure, but then if that step fails we do need to raise an error in that case so it can be surfaced appropriately.
Description
We recently encountered an issue in production wherein the indexer workers never initialized but the ingestion server was able to initialize fine. This meant that when a data refresh was initiated, the ingestion server attempted to send jobs to the indexer workers but was unable to do so since the workers (while available via DNS) were not responding to healthchecks. Here are the logs for that time period.
Reproduction
I was unable to reproduce this locally by stopping the indexer worker while an initialization was running (e.g.
just api/init
) because that resulted in an actual exception regarding a failed name resolution.