cirruslabs / gitlab-tart-executor

GitLab Runner executor to run jobs in Tart VMs
MIT License
60 stars 5 forks source link

Concurrent `--cache-dir` mount failing? #73

Closed olejnjak closed 5 months ago

olejnjak commented 7 months ago

👋 guys,

I wanted to consult a one thing. It is a few days we are experiencing some weird issues on our CI, mainly it seems to have something to do with mounting the cache dir when concurrent jobs get to mount it in about the same time.

I think the only change that occurred on our host machine could be OS update from 14.3 to 14.4.1, guest images are the same. I updated Gitlab Runner, Gitlab Tart Executor and Tart as an attempt to fix that, but without any success.

Our tart tooling should be currently up to date:

 ~ % brew list --versions
gitlab-runner 16.10.0
gitlab-tart-executor 1.13.0
softnet 0.10.1
tart 2.8.1

Our runner config:

concurrent = 2
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "<...>"
  url = "https://gitlab.<...>"
  id = 834
  token = "<...>"
  token_obtained_at = <...>
  token_expires_at = <...>
  executor = "custom"
  environment = ["MINT_PATH=/Volumes/My Shared Files/mint", "MINT_LINK_PATH=/Volumes/My Shared Files/mint/bin", "TUIST_CACHE_PATH=/Volumes/My Shared Files/TuistCache"]
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.feature_flags]
    FF_RESOLVE_FULL_TLS_CHAIN = false
  [runners.custom]
    config_exec = "gitlab-tart-executor"
    config_args = ["config", "--cache-dir", "/Users/user/Library/Caches/GitlabCI/cache"]
    prepare_exec = "gitlab-tart-executor"
    prepare_args = ["prepare", "--concurrency", "2", "--cpu", "auto", "--memory", "auto", "--dir", "mint:/Users/user/.mint", "--dir", "TuistCache:/Users/user/.tuist/Cache"]
    run_exec = "gitlab-tart-executor"
    run_args = ["run"]
    cleanup_exec = "gitlab-tart-executor"
    cleanup_args = ["cleanup"]

Executor creates different VMs correctly:

 ~ % tart list              
Source Name  Disk Size State  
local  gitlab-3331894 80   62   running
local  gitlab-3331895 80   62   running

In our environment there is nothing Tart related.

The errors we are getting look like this

Running with gitlab-runner 16.10.0 (81ab07f6)
  on <...>, system ID: <...>

Preparing the "custom" executor
Using Custom executor...
2024/04/12 01:48:57 Pulling the latest version of ghcr.io/ackeecz/ackee-xcode:latest...
2024/04/12 01:48:58 Cloning and configuring a new VM...
2024/04/12 01:48:58 Waiting for the VM to boot and be SSH-able...
2024/04/12 01:49:12 Was able to SSH!
2024/04/12 01:49:12 Mounting cachedir...
2024/04/12 01:49:12 ssh: unexpected packet in response to channel open: <nil>
ERROR: Job failed: exit status 1
Running with gitlab-runner 16.10.0 (81ab07f6)
  on <...>, system ID: <...>

Preparing the "custom" executor
Using Custom executor...
2024/04/12 02:01:52 Pulling the latest version of ghcr.io/ackeecz/ackee-xcode:latest...
2024/04/12 02:01:53 Cloning and configuring a new VM...
2024/04/12 02:01:53 Waiting for the VM to boot and be SSH-able...
2024/04/12 02:02:07 Was able to SSH!
2024/04/12 02:02:07 Mounting cachedir...
mount_virtiofs: failed to mount /Users/admin/cachedir: Resource busy
2024/04/12 02:02:08 Process exited with status 75
ERROR: Job failed: exit status 1

I think that maybe SSH config might be relevant, but there is nothing interesting:

~ % cat .ssh/config 
Host *
  UseKeychain yes
  ServerAliveInterval 120
  TCPKeepAlive no

Do you have any idea what the cause could be? We are not struggling with disk space, the host machine has over 350 GB disk space left so it should not be an issue. I don't think that using (CUSTOM_ENV_)CI_CONCURRENT_ID variable in cachedir path is a solution as I do want concurrent jobs to share the cache.

Thanks in advance :-)

EDIT: Maybe good to know, that when I was running CI on my personal Macbook with the same setup, just more boilerplate stuff in system as it is not used only for CI (unline our Mac Mini that has this issue), everything ran just fine. When I disable concurrent jobs everything seems to run just fine. Machine restart did not help.

fkorotkov commented 7 months ago

Thank you for the detailed report! This part is the most concerning:

mount_virtiofs: failed to mount /Users/admin/cachedir: Resource busy

VirtioFS works via builtin VirtioFS server/client in macOS. Host is running a server instance and guests connect to it. So an OS update either on host or guest can affect this behaviour.

Which macOS version does you guest use? How frequently this behaviour happens?

I tried to reproduce it with two VMs running 14.4.1 and host on 14.4.1 but with no luck. Mounting worked. PS I didn't even think that mounting a single folder into two VMs works 😅

olejnjak commented 7 months ago

Which macOS version does you guest use?

Guests used 14.3.1, but I am not sure if it is relevant as on my Mac with the same setup (same OS same tooling version) and exactly the same guest images it works just fine. I just updated our guests to 14.4.1 so will try with that.

How frequently this behaviour happens?

On the affected machine it was almost like 4/5 times as our pipelines start with 2 or 3 concurrent jobs.

I tried to reproduce it with two VMs running 14.4.1 and host on 14.4.1 but with no luck. Mounting worked. PS I didn't even think that mounting a single folder into two VMs works 😅

I have not configured it this way intensionally, just did it somehow and it worked for months 😂 It would actually make sense if it did not, but that might be a shame as I think that sharing it among VMs actually makes a good sense.

I am thinking if the executor couldn't rsync the cachedir to unique location (I guess in prepare stage) and in cleanup stage rsync it back, but that would of course mean introducing some locks and sync stuff which is not that simple 🤔

EDIT: What seems to be weird is that we mount shared directory for Mint using --dir in prepare stage and that doesn't seem to be problematic.

edigaryev commented 7 months ago

You might also consider using S3 cache as an alternative to the Virtualization.Framework's directory sharing as the former is notorious for random errors, especially under load, see https://github.com/cirruslabs/tart/issues/567.

Some of the GitLab Tart Executor users use this approach too.

olejnjak commented 7 months ago

Hi @edigaryev @fkorotkov,

Okay I just encountered the same Resource busy error on my Macbook running macOS 14.4.1 and 14.4.1 guest image. I will accompany concurrent ID to our cache dir path.

If it is not meant to be shared among concurrent executions, I think we can close this issue. But maybe it might be worth documenting? Also for --builds-dir but it is more obvious for that case.

olejnjak commented 7 months ago

Okay, now it becomes interesting...

I am getting resource busy error even without shared cache dir 😬

Running with gitlab-runner 16.10.0 (81ab07f6)
  on <...>, system ID: <...>

Preparing the "custom" executor
Using Custom executor...
2024/04/23 00:36:24 Pulling the latest version of ghcr.io/AckeeCZ/ackee-xcode:latest...
2024/04/23 00:36:25 Cloning and configuring a new VM...
2024/04/23 00:36:25 Waiting for the VM to boot and be SSH-able...
2024/04/23 00:36:33 Was able to SSH!
2024/04/23 00:36:33 Mounting cachedir...
mount_virtiofs: failed to mount /Users/admin/cachedir: Resource busy
2024/04/23 00:36:33 Process exited with status 75
ERROR: Job failed: exit status 1

~/.gitlab-runner/config.toml

concurrent = 2
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "<...>"
  url = "<url>"
  id = 834
  token = "<token>"
  token_obtained_at = <date>
  token_expires_at = <date>
  executor = "custom"
  environment = ["MINT_PATH=/Volumes/My Shared Files/mint", "MINT_LINK_PATH=/Volumes/My Shared Files/mint/bin", "TUIST_CACHE_PATH=/Volumes/My Shared Files/TuistCache"]
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.feature_flags]
    FF_RESOLVE_FULL_TLS_CHAIN = false
  [runners.custom]
    config_exec = "gitlab-tart-executor"
    config_args = ["config", "--cache-dir", "/Users/user/Library/Caches/GitlabCI/cache/$CUSTOM_ENV_CI_CONCURRENT_PROJECT_ID"]
    prepare_exec = "gitlab-tart-executor"
    prepare_args = ["prepare", "--concurrency", "2", "--cpu", "auto", "--memory", "auto", "--dir", "mint:/Users/user/.mint", "--dir", "TuistCache:/Users/user/.tuist/Cache"]
    run_exec = "gitlab-tart-executor"
    run_args = ["run"]
    cleanup_exec = "gitlab-tart-executor"
    cleanup_args = ["cleanup"]

Any ideas before I reinstall our runner machine?

edigaryev commented 7 months ago

I am getting resource busy error even without shared cache dir 😬

But you have the --cache-dir in your configuration, in config_args.

As for the Resource busy, it looks like it is returned when the the volume is already mounted.

  1. Which Tart version are you using?

  2. Which GitLab Tart Executor version are you using?

olejnjak commented 7 months ago
~ % brew list  --versions | grep tart
gitlab-tart-executor 1.13.0
tart 2.8.1

As for the Resource busy, it looks like it is returned when the the volume is already mounted.

You mean in the VM?

edigaryev commented 7 months ago

You mean in the VM?

Yes.

Does it work without the "--dir", "mint:/Users/user/.mint", "--dir", "TuistCache:/Users/user/.tuist/Cache" in prepare_args?

olejnjak commented 7 months ago

Well not really.

~ % brew list --version | grep tart
gitlab-tart-executor 1.13.0
tart 2.9.0

At first I tried without --dir and pointing things we want cached on runner to /Users/admin/cachedir, still the same Resource busy error. Pretty much the first jobs that got executed failed with that error

concurrent = 2
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "<...>"
  limit = 2
  url = "<...>"
  id = 840
  token = "<...>"
  token_obtained_at = <...>
  token_expires_at = <...>
  executor = "custom"
  environment = ["MINT_PATH=/Users/admin/cachedir/mint", "MINT_LINK_PATH=/Users/admin/cachedir/mint/bin", "TUIST_CACHE_PATH=/Users/admin/cachedir/TuistCache"]
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.feature_flags]
    FF_RESOLVE_FULL_TLS_CHAIN = true
  [runners.custom]
    config_exec = "gitlab-tart-executor"
    config_args = ["config", "--cache-dir", "/Users/<user>/Gitlab/cache/$CUSTOM_ENV_CI_CONCURRENT_PROJECT_ID"]
    prepare_exec = "gitlab-tart-executor"
    prepare_args = ["prepare", "--memory", "16384"]
    run_exec = "gitlab-tart-executor"
    run_args = ["run"]
    cleanup_exec = "gitlab-tart-executor"
    cleanup_args = ["cleanup"]

Then I tried completely without our caching stuff and jobs seem to run. That introduces like extra half an hour to our pipeline.

concurrent = 2
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "<...>"
  limit = 2
  url = "<...>"
  id = 840
  token = "<...>"
  token_obtained_at = <...>
  token_expires_at = <...>
  executor = "custom"
#  environment = ["MINT_PATH=/Users/admin/cachedir/mint", "MINT_LINK_PATH=/Users/admin/cachedir/mint/bin", "TUIST_CACHE_PATH=/Users/admin/cachedir/TuistCache"]
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.feature_flags]
    FF_RESOLVE_FULL_TLS_CHAIN = true
  [runners.custom]
    config_exec = "gitlab-tart-executor"
    config_args = ["config", "--cache-dir", "/Users/<user>/Gitlab/cache/$CUSTOM_ENV_CI_CONCURRENT_PROJECT_ID"]
    prepare_exec = "gitlab-tart-executor"
    prepare_args = ["prepare", "--memory", "16384"]
    run_exec = "gitlab-tart-executor"
    run_args = ["run"]
    cleanup_exec = "gitlab-tart-executor"
    cleanup_args = ["cleanup"]
edigaryev commented 7 months ago

Then I tried completely without our caching stuff and jobs seem to run. That introduces like extra half an hour to our pipeline.

How these commented-out environment variables are used?

Perhaps you could try minimizing them down to a single variable, to find out which one of them causing this error?

olejnjak commented 7 months ago

Well those properties should not have any actual effect unless Mint or Tuist are called. And the pipeline never got to executing the job as VM is ready. was not output.

Not even sure if that environment is a real trace, might have been just a coincidence that it did work with it commented out...it is hard to let that executor run concurrently when that config causes big performance loss.

olejnjak commented 7 months ago

Okay new update, that environment doesn't matter just got the same Resource busy error even when it is commented out.

olejnjak commented 7 months ago

I looked deeper into the code and I am now able to reproduce the issue locally, without running using Gitlab CI

Just a note, I use separate terminals so I can run blocking calls in parallel without backgrounding it.

  1. I created two identical machines:
    tart clone ghcr.io/ackeecz/ackee-xcode:latest test1
    tart clone ghcr.io/ackeecz/ackee-xcode:latest test2

Ran both of them with cachedir virtio tag, that Tart uses:

tart run test1 --dir /Users/olejnjak/Gitlab/cache/0:tag=tart.virtiofs.cachedir 
tart run test2 --dir /Users/olejnjak/Gitlab/cache/1:tag=tart.virtiofs.cachedir 

SSH into the first one:

ssh admin@`tart ip test1`

And tried to mount cachedir:

admin@admins-Virtual-Machine ~ % mkdir -p cachedir
admin@admins-Virtual-Machine ~ % mount_virtiofs tart.virtiofs.cachedir cachedir
mount_virtiofs: failed to mount /Users/admin/cachedir: Resource busy

Can you think about something I can do in this case to help to debug/prevent this error?

olejnjak commented 7 months ago

Sorry for the spam but I am adding info and ideas when I get them 😃

I managed to mount cachedir to two running VMs when they were not using the same tag.

What do you think about adding a job id to the virtiofs tags, that the executor uses? Don't think this is a solution if we are the only users that encounter this issue (but we currently have 2 devices experiencing it), otherwise I think it would be worth it.

I currently have a spare Mac, that I can clean install (system, Tart, Gitlab Tart Executor) and check if this issue is present.

edigaryev commented 7 months ago

I looked deeper into the code and I am now able to reproduce the issue locally, without running using Gitlab CI

I've tried reproducing this on ghcr.io/cirruslabs/macos-sonoma-base:latest to no avail.

The ghcr.io/ackeecz/ackee-xcode:latest is a bit hefty to pull, do you mind checking if this reproduces on ghcr.io/cirruslabs/macos-sonoma-base:latest too?

What do you think about adding a job id to the virtiofs tags, that the executor uses?

I'm not yet sure if this is related because these tags are unlikely to be system-wide, otherwise with the default com.apple.virtio-fs.automount things would go haywire for many users.

olejnjak commented 7 months ago

Well tried the ghcr.io/cirruslabs/macos-sonoma-base:latest, did work. Immediately after that tried again our image and it did work as well 🤯

olejnjak commented 6 months ago

Looks like I managed to work around it by downgrading Tart to 2.5.0 and Gitlab Executor to 1.10.0-6f152b4, at least on my Mac 🤔 No particular reason why I chose those versions.

But such workarounds are pretty difficult as Hombrew is the only installation option besides manual installation.

olejnjak commented 6 months ago

Well I did the same downgrade (Tart 2.5.0, Gitlab Executor 1.10.0, well also Softnet 0.8.2 but we don't use it so I don't think it matters) on our main CI Mac Mini and haven't experienced that issue since, while using the same host system and the same guest images 🤔

edigaryev commented 6 months ago

It's probably the ability to customize the VirtioFS tags that was introduced in 2.6.0 (just after the 2.5.0 release) is at play.

Could you check if the latest Tart version works with the GitLab Executor downgraded to 1.10.0?

olejnjak commented 6 months ago

Yes, I will try this week ;-)

olejnjak commented 6 months ago

I just did the update, will leave updated at least for this week (maybe even the next week) and then report back.

olejnjak commented 5 months ago

Nice monday guys 🙂 so far I haven't seen this issue, so I think I can say that in those mentioned versions it is not present. Do you suggest any other approach than continuously updating Gitlab Tart Executor, to see which version introduced that for us?

olejnjak commented 5 months ago

With this setup I just started to see some other weird failures...

Running with gitlab-runner 16.11.0 (91a27b2a)
  on <redacted>, system ID: <redacted>

Preparing the "custom" executor
Using Custom executor...
2024/05/28 10:50:04 Pulling the latest version of ghcr.io/AckeeCZ/ackee-xcode:latest...
2024/05/28 10:50:05 Cloning and configuring a new VM...
2024/05/28 10:50:05 Waiting for the VM to boot and be SSH-able...
2024/05/28 10:50:05 Was able to SSH!
2024/05/28 10:50:05 VM is ready.

Preparing environment
Running on admins-Virtual-Machine.local...

Getting source from Git repository
Fetching changes with git depth set to 20...
Initialized empty Git repository in /private/tmp/builds/<redacted>/.git/
Created fresh repository.
Checking out afc72da1 as detached HEAD (ref is refs/merge-requests/996/head)...
2024/05/28 10:50:24 wait: remote command exited without exit status or exit signal

Uploading artifacts for failed job
cd: no such file or directory: /private/tmp/builds/<redacted>
2024/05/28 10:50:24 Process exited with status 1

Cleaning up project directory and file based variables
ERROR: Job failed: exit status 1
fkorotkov commented 5 months ago

This weirdness might be related to https://github.com/cirruslabs/tart/issues/567

As for the initial issue I don't think we changed anything on our side to make it work again. Overall seems like instabilities in VirtioFS integration that we don't have control over. Please consider using an S3 cache.

olejnjak commented 5 months ago

@fkorotkov Well okay, let's go with VirtioFS instabilities, still might be worth to prevent it - what do you think about adding the job id to the VirtioFS tag as mentioned earlier probably with some kind of opt-in (argument in config stage/environment value/...)...are you open to such PR?

fkorotkov commented 5 months ago

Oh! I think I missed that point about unique tags. Given that VirtioFS has a single daemon on the host side it might be the reason. IMO we can just make the tags unique. It should not break current behaviour. Implementing in #76