Open kpenfound opened 1 year ago
Currently in a 'monitoring' state on this one to see if it comes up again. We'll also need an upstream change to debug this further which includes the stack in the trace logs which @sipsma has more context on
Yeah for this one we'll need to catch it happening in one of the engines in the dev cluster w/ trace logs enabled.
I just last week added even more trace logs to buildkit that should help debug this (and related bugs) further, on top of the previous ones. We are planning on a release tomorrow, so once that's out, Airbyte should be able to upgrade and pick up those extra logs too.
For some more context on the logs that will be of interest:
removed snapshot
will tell us when a snapshot dir has been deleted
key
field in these messages should be in the form buildkit/<some int>/<cache record id>
, so we can also find when the cache record for the cache mount has been removed from diskWith all of the above, we should be able to first figure out what cache record ID corresponded to the cache mount that went missing. Then we can look through the rest of the logs to confirm that the snapshot for the cache ref was removed prematurely. And from there use the stack traces on the acquire/release/remove logs to see what went wrong and which callers may be responsible for removing it early.
I realize this is probably kind of confusing still if you haven't had your head buried in buildkit internals previously, but if anyone else on the Dagger team ends up looking into this, hopefully it's a helpful start and I'm happy to clarify anything further.
Hey @sipsma, I would love to better understand the root cause of these problems (you can find impacted workflows in the related issues mentioned above). I believe it can be related to the sheer amount of volumes we create for docker libs in the dockerd service. You suggested us using a single volume for docker libs instead of creating one per connector. I'll try to implement this, but I would also love to understand why it might cause these problems if it's indeed the root cause.
I confirm that disabling concurrency solves this problem. By disabling concurrency I mean running one connector test pipeline at a time.
@kpenfound / @sipsma I made a logical change to our pipeline: the dockerd service is now a singleton. I'm trying out a nightly build on https://github.com/airbytehq/airbyte/pull/27021 to check if it has positive effects.
hey @alafanechere , thanks for the update! On our end we're hoping to see this again with the new engine to debug further.
I would love to better understand the root cause of these problems
Essentially buildkit seems to be removing cache when it shouldn't be. There's an internal system that tracks when a part of the cache is being referenced, and will only remove it if it's not being referenced. Based on the behavior, it seems like buildkit is removing cache that is still being referenced, meaning a reference is being lost track of somewhere. The updated logs will help determine when/where that is happening.
Closing as https://github.com/airbytehq/airbyte/pull/27021 fixed the problem: using a single dockerd service instead of one per connector mitigated the problem and we were able to run a full nightly build without these errors.
Thanks I missed that one from the truncated logs @alafanechere !
Reopening because it has occurred again: https://github.com/airbytehq/airbyte/actions/runs/5234829960/jobs/9451260504
2023-06-11T12:46:27.3767228Z #61 0.100 runc run failed: unable to start container process: error during container init: error mounting "/var/lib/dagger/runc-overlayfs/snapshots/snapshots/114670/fs" to rootfs at "/tmp": stat /var/lib/dagger/runc-overlayfs/snapshots/snapshots/114670/fs: no such file or directory
At Airbyte, we seek to be clear about the project priorities and roadmap. This issue has not had any activity for 180 days, suggesting that it's not as critical as others. It's possible it has already been fixed. It is being marked as stale and will be closed in 20 days if there is no activity. To keep it open, please comment to let us know why it is important to you and if it is still reproducible on recent versions of Airbyte.
The following category of errors have been observed in Publish pipelines:
Originally reported in https://github.com/airbytehq/airbyte/issues/25877
Relevant comment from Erik: https://github.com/airbytehq/airbyte/issues/25877#issuecomment-1548300551
Example failure: https://github.com/airbytehq/airbyte/actions/runs/4919988172/jobs/8788287616#step:5:6107