Closed seanglynn-thrive closed 1 year ago
FYI @olivermeyer @pcorbel @davinchia @marcosmarxm if you have anything to add :)
x-post: https://github.com/airbytehq/airbyte-platform/pull/205#issuecomment-1496712268
In short; you probably need to run multiple workers. By default each worker can accommodate 10 concurrent jobs and the default replica count is 1 for workers. If your jobs take a long time; its possible that the ports will not be reclaimed fast enough for new syncs. If you have 15 concurrent connections; you may want to increase your replica count to 2.
x-post: airbytehq/airbyte-platform#205 (comment)
In short; you probably need to run multiple workers. By default each worker can accommodate 10 concurrent jobs and the default replica count is 1 for workers. If your jobs take a long time; its possible that the ports will not be reclaimed fast enough for new syncs. If you have 15 concurrent connections; you may want to increase your replica count to 2.
Unfortunately this doesn't help in our case. We already have 8 worker replicas, so if you're right we should be able to handle up to 80 concurrent connections. In practice Airbyte struggles to handle just 30. Also, none of our jobs take over an hour to run, and they are definitely all finished by the time we trigger them again roughly 12 hours later.
To me the symptoms still point to ports not being reclaimed after a sync ends: everything works just fine for some time after restarting the workers (two days in our case, but I suspect this depends on the number of connections and how often they run), and after that all syncs fail consistently. I don't know how to troubleshoot this further though. Hopefully @seanglynn-thrive's additional logging will shed some light.
x-post: airbytehq/airbyte-platform#205 (comment) In short; you probably need to run multiple workers. By default each worker can accommodate 10 concurrent jobs and the default replica count is 1 for workers. If your jobs take a long time; its possible that the ports will not be reclaimed fast enough for new syncs. If you have 15 concurrent connections; you may want to increase your replica count to 2.
Unfortunately this doesn't help in our case. We already have 8 worker replicas, so if you're right we should be able to handle up to 80 concurrent connections. In practice Airbyte struggles to handle just 30. Also, none of our jobs take over an hour to run, and they are definitely all finished by the time we trigger them again roughly 12 hours later.
To me the symptoms still point to ports not being reclaimed after a sync ends: everything works just fine for some time after restarting the workers (two days in our case, but I suspect this depends on the number of connections and how often they run), and after that all syncs fail consistently. I don't know how to troubleshoot this further though. Hopefully @seanglynn-thrive's additional logging will shed some light.
To add to this: We also have a similar setup with 5 worker replicas + 15 connections running at different times every hour to avoid minimal sync concurrency (Connection A runs at 0 minutes on the hour, Connection B runs at 10 minutes on the hour etc.). Each job takes 1-6 minutes to complete. We have even performed some stress tests, where we executed all jobs at once, which caused no issues and returned 0 failures.
Initially, we had a single bulky worker (High memory/cpu allocations) doing all of the heavy lifting but we then started to notice this issue occur every 24 hours or so causing an outage. We then scaled out our workers to 3 replicas which prevented the issue from occurring as frequently (Every 48 - 72 hours).
We scaled to 5 replicas, which improved the situation by delaying this issue into occurring every 4-5 days.
From our experience: scaling out the workers seems to delay this exception from occurring but does not resolve the issue.
So if we can all agree that the issue lies within the KubePortManager's port allocation, I think we can work together to narrow it down.
QS 1: Is it possible that the KubePortManager holds on to ports that were allocated to a job at some point in the past?
There are some stale job pods that exist within the k8s namespace that did not complete or reach a healthy state (e.g: Error
/ Init:Error
). Could the KubePortManager be retaining old ports for failed/incomplete jobs which accumulate over time?
Example:
NAME READY STATUS RESTARTS AGE
source-postgres-read-13100-0-alsyl 0/4 Init:Error 0 29h
source-postgres-read-13264-0-xncxi 0/4 Init:Error 0 15h47m
source-postgres-read-13283-0-lgfkw 0/4 Init:Error 0 14h22m
QS 2: Is there a connection between the KubePortManager class (within the Worker) and the PodSweeper that keeps both in sync with each other? For example, if the pod sweeper deletes old pods at the kubernetes level, is this change reflected in the KubePortManager?
Another approach we tried to avoid this issue was to increase the number of TEMPORAL_WORKER_PORTS
from 40-80.
This did not give us the results we expected :(
Is it possible/recommended to significantly increase the number of ports available under this configuration?
I put together https://github.com/airbytehq/airbyte-platform/pull/217 in an attempt to try and fix this issue.
I believe the problem is actually during Pod creation. If the init container fails, the ports are never reclaimed because this all happens in the constructor. This may lead to port exhaustion like we are experiencing here.
We upgraded after the PR above was merged + released. At first it seemed to have fixed the issue as we went almost three full days with no issues, but we just started getting the same errors in our syncs:
2023-05-18 09:44:57 ERROR i.a.w.g.DefaultReplicationWorker(replicate):279 - Sync worker failed.
io.airbyte.workers.exception.WorkerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "io.airbyte.workers.process.KubePortManagerSingleton.take()" is null
So it looks like the PR helped but didn't fix the issue entirely.
We also encountered this issue a couple times now (on v0.40.22). Restarting the workers helps, but it's only a temporary solution. We would like to upgrade and have been waiting for a version where a fix for this issue has been included
@olivermeyer are you still running into the issue or did you find a way to fix it?
We also encountered this issue a couple times now (on v0.40.22). Restarting the workers helps, but it's only a temporary solution. We would like to upgrade and have been waiting for a version where a fix for this issue has been included
@olivermeyer are you still running into the issue or did you find a way to fix it?
Upgrading the chart to v0.45.35 fixed the issue for us.
@benmoriceau I think there was a PR to fix it right? Can you link the work and close the issue?
On v0.45.0
char version the issue is also not reproducing for a while, thank you.
Issue resolved since Airbyte: v0.45.0
🚀
YES 👍 I have already opened a PR here to add better logging to the
KubePortManagerSingleton
class as there seems to be very little logging which makes things very difficult to triage.