airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.19k stars 4.14k forks source link

connector-ci: Use Dagger in CAT #27858

Closed alafanechere closed 1 year ago

alafanechere commented 1 year ago

Context

Our Connector Acceptance Tests (CAT) project is a python program running tests on our connector docker images. These tests execute commands on the container and perform assertions on the output. We containerize CAT and ran it so far with airbyte-ci using a docker-in-docker pattern:

  1. airbyte-ci builds the connector container using Dagger.
  2. The image is exported as a tarball file and loaded to a global dockerd service.
  3. The CAT container is bound to the global dockerd service so it has access to the build container image.
  4. We run CAT from airbyte-ci

Problems

We'd love to not rely on the docker-in-docker pattern for the following reasons:

Solution

To get rid of the docker-in-docker pattern, we'd love to make CAT use Dagger. Docker interactions with containers in CAT are located in a single class: ConnectorRunner.

Implementation

In CAT I refactored ConnectorRunner class to use Dagger instead of docker commands. In airbyte-ci I refactored our with_connector_acceptance_test function to build the CAT container and mount the connector image tarball to it.

How airbyte-ci shares the connector under test image / container with CAT:

We share the build connector image from airbyte-ci by mounting the tarball file to the CAT container and use dagger_client.container().import_(<tarball_file>) to create the connnector container.

I originally tried to share the connector container by sharing the ContainerID to CAT and instantiate the container in CAT with dagger_client.container(dagger.ContainerId(<container_id>)) but it didn't work I occassionally face error like failed to load cache key: unable to get info about digest: NotFound: rpc error: code = NotFound desc = content sha256:7e9d2045e9891e8eef3c917af3a5246a292c2b2c8c7f8c0d0b76a357d5439ea8: .

Current blockers

It's not working for heavy images

The current implementation works for lightweight images (~50mb). But when I try to run CAT on a heavy images (~500mb) the test execution hangs... My hypothesis is that the import_ is a costly operation, and as we instantiate a new Dagger connection for each exec we want on to run on a container, we pay the cost of the import for each exec. We instantiate a new Dagger connection for each exec because the public methods of ConnectorRunner function are synchronous, so we run anyio.run inside them. I'd prefer to not change the public methods of ConnectorRunner to be async as it would require to change the whole CAT codebase to be async.

bnchrch commented 1 year ago

Well laid out @alafanechere 👏

Some great benefits here are

  1. Naturally pushes us to update CAT to 3.10
  2. No interface changes so everything should continue to work smoothly
  3. Dagger is well suited to coordinate docker based systems

So far im a tentative thumbs up!

One question, Is there any other known challenges to "dagger in dagger"?

evantahler commented 1 year ago

+1 from me too! The benefits are real, and the more airbyte code we can standardize, the better.

I'd suggest time boxing this. If there are dagger-in-dagger problems, let's only spend a few days on this, as it is nice-to-have, not required

alafanechere commented 1 year ago

Thank you for your inputs @bnchrch and @evantahler! I asked a couple of month ago to the dagger team if dagger in dagger is tricky, they told it was not because they follow this paradigm for their tests. I think it's indeed worth timeboxing / spiking this: not more than 1.5 day.

alafanechere commented 1 year ago

Updates

The hypothesis we came up with the Dagger team was the following: The import_ operation is not cached throughout sessions, as a new Dagger session is instantiated we pay the cost of loading the tarball to a container on each ConnectorRunner method call.

I refactored the CAT code to use a single Dagger sessions. But the "hanging test" error remain and finish with timeout issue:

Running locally, the source-openweather test run correctly. ( The TypeError: object ConfiguredAirbyteCatalog can't be used in 'await' expression was a easily fixed)

alafanechere commented 1 year ago

I provided the Dagger team with access to source-openweather and source-postgres secrets for repro.

evantahler commented 1 year ago

Currently testing with nightly builds... PR review will be big... so... thanks :D

alafanechere commented 1 year ago

Moving this back to the sprint backlog as I want to focus on other task force issue and https://github.com/airbytehq/airbyte/pull/28656 fixed a lot of transient CAT failure.