Ingestion server does not appropriately report when indexer workers cannot be reached

AetherUnbound commented 1 year ago

Description

We recently encountered an issue in production wherein the indexer workers never initialized but the ingestion server was able to initialize fine. This meant that when a data refresh was initiated, the ingestion server attempted to send jobs to the indexer workers but was unable to do so since the workers (while available via DNS) were not responding to healthchecks. Here are the logs for that time period.

2023-07-23T00:02:30.319Z    [2023-07-23 00:02:30,319 - root - 307][INFO] Running distributed index using indexer workers.
2023-07-23T00:02:31.562Z    [2023-07-23 00:02:31,562 - root - 89][INFO] Selected worker instances ['172.31.74.186', '172.31.71.220', '172.31.71.105', '172.31.74.208', '172.31.72.237', '172.31.74.125']
2023-07-23T00:02:32.097Z    [2023-07-23 00:02:32,096 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:02:37.099Z    [2023-07-23 00:02:37,098 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:02:42.100Z    [2023-07-23 00:02:42,100 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:02:47.102Z    [2023-07-23 00:02:47,102 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:02:52.104Z    [2023-07-23 00:02:52,104 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:02:57.106Z    [2023-07-23 00:02:57,106 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:02.108Z    [2023-07-23 00:03:02,108 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:07.109Z    [2023-07-23 00:03:07,109 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:12.111Z    [2023-07-23 00:03:12,111 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:17.113Z    [2023-07-23 00:03:17,113 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:22.115Z    [2023-07-23 00:03:22,114 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:27.117Z    [2023-07-23 00:03:27,116 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:32.118Z    [2023-07-23 00:03:32,118 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:37.120Z    [2023-07-23 00:03:37,120 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:42.122Z    [2023-07-23 00:03:42,122 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:47.124Z    [2023-07-23 00:03:47,124 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:52.125Z    [2023-07-23 00:03:52,125 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:03:57.128Z    [2023-07-23 00:03:57,127 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:02.129Z    [2023-07-23 00:04:02,129 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:07.131Z    [2023-07-23 00:04:07,131 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:12.133Z    [2023-07-23 00:04:12,133 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:17.135Z    [2023-07-23 00:04:17,134 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:22.136Z    [2023-07-23 00:04:22,136 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:27.138Z    [2023-07-23 00:04:27,138 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:32.140Z    [2023-07-23 00:04:32,140 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:37.142Z    [2023-07-23 00:04:37,142 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:42.144Z    [2023-07-23 00:04:42,144 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:47.145Z    [2023-07-23 00:04:47,145 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:52.147Z    [2023-07-23 00:04:52,147 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:04:57.149Z    [2023-07-23 00:04:57,149 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:02.151Z    [2023-07-23 00:05:02,151 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:07.153Z    [2023-07-23 00:05:07,153 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:12.155Z    [2023-07-23 00:05:12,155 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:17.156Z    [2023-07-23 00:05:17,156 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:22.158Z    [2023-07-23 00:05:22,158 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:27.161Z    [2023-07-23 00:05:27,161 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:32.163Z    [2023-07-23 00:05:32,162 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:37.164Z    [2023-07-23 00:05:37,164 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:42.166Z    [2023-07-23 00:05:42,166 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:47.168Z    [2023-07-23 00:05:47,168 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:52.169Z    [2023-07-23 00:05:52,169 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:05:57.172Z    [2023-07-23 00:05:57,171 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:02.173Z    [2023-07-23 00:06:02,173 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:07.175Z    [2023-07-23 00:06:07,175 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:12.177Z    [2023-07-23 00:06:12,177 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:17.179Z    [2023-07-23 00:06:17,178 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:22.180Z    [2023-07-23 00:06:22,180 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:27.183Z    [2023-07-23 00:06:27,183 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:32.185Z    [2023-07-23 00:06:32,184 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:37.186Z    [2023-07-23 00:06:37,186 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:42.188Z    [2023-07-23 00:06:42,188 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:47.190Z    [2023-07-23 00:06:47,190 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:52.192Z    [2023-07-23 00:06:52,191 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:06:57.194Z    [2023-07-23 00:06:57,193 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:07:02.195Z    [2023-07-23 00:07:02,195 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:07:07.197Z    [2023-07-23 00:07:07,197 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:07:12.199Z    [2023-07-23 00:07:12,198 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:07:17.200Z    [2023-07-23 00:07:17,200 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:07:22.202Z    [2023-07-23 00:07:22,202 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:07:27.204Z    [2023-07-23 00:07:27,204 - root - 107][INFO] Checking http://172.31.74.186:8002/healthcheck. . .
2023-07-23T00:07:32.206Z    [2023-07-23 00:07:32,206 - root - 117][ERROR] Timed out waiting for http://172.31.74.186:8002/healthcheck.
2023-07-23T00:07:32.281Z    [2023-07-23 00:07:32,281 - root - 268][INFO] Task 42294a1d73a44b4c87edce7e76cbccb0 completed.
2023-07-23T00:07:32.286Z    [2023-07-23 00:07:32,286 - urllib3.connectionpool - 823][WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2576)'))': /api/4504934126059520/envelope/

Reproduction

I was unable to reproduce this locally by stopping the indexer worker while an initialization was running (e.g. just api/init) because that resulted in an actual exception regarding a failed name resolution.

sarayourfriend commented 1 year ago

To address this, we should add a check for indexer worker availability in the health check endpoint (/ of the "server"'s API).

Though, I'm not sure how this would work. IIRC the workers are stopped when they aren't being used, right? If that's the case, then we'd need to add some kind of startup check for worker availability? :thinking:

AetherUnbound commented 1 year ago

They are stopped when not running an actual index job. I think the important piece here is adding some mechanism on the ingestion server which alerts when this particular error case occurs (rather than remain silent).

sarayourfriend commented 1 year ago

I see. Are you saying, rather than a healthcheck, the ingestion server should properly report connection errors to the workers when trying to run an index? As in, we need to raise an error here: https://github.com/WordPress/openverse/blob/697f62f01a32cb7fcf2f4a7627650a113cba40da/ingestion_server/ingestion_server/distributed_reindex_scheduler.py#L46-L50

It's quite odd indeed that it just returns false, which is never handled!

AetherUnbound commented 1 year ago

Yes! We can use that wait for healthcheck logic for sure, but then if that step fails we do need to raise an error in that case so it can be surfaced appropriately.

WordPress / openverse

Ingestion server does not appropriately report when indexer workers cannot be reached #2708

Description

Reproduction