airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.17k stars 4.14k forks source link

Concurrent CDK: POC to integrate asyncio #32873

Closed clnoll closed 11 months ago

clnoll commented 11 months ago

Background

We're currently using threads in the concurrent CDK; however, this means that when each thread picks up a task from the queue, it's blocked from doing any other work. Since the tasks involve I/O, this leads to a situation where all workers are waiting on I/O while the queue grows. To mitigate this, we've added some complexity to the concurrent CDK that will prevent the workers from picking up solely partition generation tasks when they first start processing tasks, but this isn't an ideal solution because workers are still unable to process other work while waiting on I/O.

The solution is to use asyncio to process tasks, which allows the task executor to allocate work during I/O. We didn't originally use asyncio due to concerns about the LOE of the code change, but given the drawbacks listed above we're revisiting it.

Suggested Implementation

A toy POC was created that shows how we might integrate asyncio into the existing concurrent CDK infrastructure. This ticket is for actually moving this logic into the concurrent CDK and getting it working end-to-end for a connector. The toy POC largely follows the patterns put in place for the existing concurrent CDK: using a task queue to which partition generation and partition read tasks are submitted.

One of the main challenges will be limiting/isolating the surface area of the change to the CDK and connectors to only what's required for the performance improvements.

Acceptance Criteria

clnoll commented 11 months ago

grooming notes: