airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.71k stars 3.78k forks source link

connectors-ci: `runc run failed` errors #26862

Open kpenfound opened 1 year ago

kpenfound commented 1 year ago

The following category of errors have been observed in Publish pipelines:

runc run failed: unable to start container process: error during container init: error mounting "/var/lib/dagger/runc-overlayfs/snapshots/snapshots/347127/fs" to rootfs at "/var/lib/docker": stat /var/lib/dagger/runc-overlayfs/snapshots/snapshots/347127/fs: no such file or directory

Originally reported in https://github.com/airbytehq/airbyte/issues/25877

Relevant comment from Erik: https://github.com/airbytehq/airbyte/issues/25877#issuecomment-1548300551

Yeah this one I was not able to repro on my own in the past, but I will look through your logs to find any clues about the one you most recently reported in discord.

In general, the error means buildkit is trying to mount a cache dir into a container that no longer exists on disk, which should never happen. So the most likely explanation is a "use-after-free" type of bug somewhere, but just looking through the relevant code in buildkit I haven't spotted anything yet. I may end up just upstreaming more logs to buildkit that will help debug this further, but we'll see what I find.

It's very hard to say what you could do to mitigate it since root cause is so unclear right now. The only thing that is at least worth a shot would be to use a single dockerd service for all connectors in a test suite, rather than giving each connector its own. That should greatly reduce the number of times that these containers need to be spun up and down, which theoretically could reduce how often this bug has a chance of occurring. But again, that's just a best guess for now. Re-using dockerd more is probably a performance improvement anyways, so doesn't hurt to try it anyways.

Example failure: https://github.com/airbytehq/airbyte/actions/runs/4919988172/jobs/8788287616#step:5:6107

kpenfound commented 1 year ago

Currently in a 'monitoring' state on this one to see if it comes up again. We'll also need an upstream change to debug this further which includes the stack in the trace logs which @sipsma has more context on

sipsma commented 1 year ago

Yeah for this one we'll need to catch it happening in one of the engines in the dev cluster w/ trace logs enabled.

I just last week added even more trace logs to buildkit that should help debug this (and related bugs) further, on top of the previous ones. We are planning on a release tomorrow, so once that's out, Airbyte should be able to upgrade and pick up those extra logs too.


For some more context on the logs that will be of interest:

  1. Either of these logs will tell us the mapping of dagger cache volume key -> buildkit cache record id
  2. These logs will let us track usage of shared cache mounts
  3. Logs w/ message removed snapshot will tell us when a snapshot dir has been deleted
    • The key field in these messages should be in the form buildkit/<some int>/<cache record id>, so we can also find when the cache record for the cache mount has been removed from disk
  4. Finally, every cache ref (which is a pointer to a cache record) can be tracked through these logs:

With all of the above, we should be able to first figure out what cache record ID corresponded to the cache mount that went missing. Then we can look through the rest of the logs to confirm that the snapshot for the cache ref was removed prematurely. And from there use the stack traces on the acquire/release/remove logs to see what went wrong and which callers may be responsible for removing it early.

I realize this is probably kind of confusing still if you haven't had your head buried in buildkit internals previously, but if anyone else on the Dagger team ends up looking into this, hopefully it's a helpful start and I'm happy to clarify anything further.

alafanechere commented 1 year ago

Hey @sipsma, I would love to better understand the root cause of these problems (you can find impacted workflows in the related issues mentioned above). I believe it can be related to the sheer amount of volumes we create for docker libs in the dockerd service. You suggested us using a single volume for docker libs instead of creating one per connector. I'll try to implement this, but I would also love to understand why it might cause these problems if it's indeed the root cause.

alafanechere commented 1 year ago

I confirm that disabling concurrency solves this problem. By disabling concurrency I mean running one connector test pipeline at a time.

alafanechere commented 1 year ago

@kpenfound / @sipsma I made a logical change to our pipeline: the dockerd service is now a singleton. I'm trying out a nightly build on https://github.com/airbytehq/airbyte/pull/27021 to check if it has positive effects.

kpenfound commented 1 year ago

hey @alafanechere , thanks for the update! On our end we're hoping to see this again with the new engine to debug further.

I would love to better understand the root cause of these problems

Essentially buildkit seems to be removing cache when it shouldn't be. There's an internal system that tracks when a part of the cache is being referenced, and will only remove it if it's not being referenced. Based on the behavior, it seems like buildkit is removing cache that is still being referenced, meaning a reference is being lost track of somewhere. The updated logs will help determine when/where that is happening.

alafanechere commented 1 year ago

@kpenfound we continued to faced these problems on the new engine (example). The good news is that using a single dockerd service as @sipsma suggested seems to solve the problem (example)!!! I made the change in this PR and will merge it soon.

alafanechere commented 1 year ago

Closing as https://github.com/airbytehq/airbyte/pull/27021 fixed the problem: using a single dockerd service instead of one per connector mitigated the problem and we were able to run a full nightly build without these errors.

kpenfound commented 1 year ago

Thanks I missed that one from the truncated logs @alafanechere !

alafanechere commented 1 year ago

Reopening because it has occurred again: https://github.com/airbytehq/airbyte/actions/runs/5234829960/jobs/9451260504

2023-06-11T12:46:27.3767228Z #61 0.100 runc run failed: unable to start container process: error during container init: error mounting "/var/lib/dagger/runc-overlayfs/snapshots/snapshots/114670/fs" to rootfs at "/tmp": stat /var/lib/dagger/runc-overlayfs/snapshots/snapshots/114670/fs: no such file or directory
octavia-squidington-iii commented 5 days ago

At Airbyte, we seek to be clear about the project priorities and roadmap. This issue has not had any activity for 180 days, suggesting that it's not as critical as others. It's possible it has already been fixed. It is being marked as stale and will be closed in 20 days if there is no activity. To keep it open, please comment to let us know why it is important to you and if it is still reproducible on recent versions of Airbyte.