cirruslabs / gitlab-tart-executor

GitLab Runner executor to run jobs in Tart VMs
MIT License
60 stars 5 forks source link

VM errored: failed to connect via SSH #49

Closed Lakr233 closed 10 months ago

Lakr233 commented 11 months ago

I have no idea about this issue. Neither about what's happening and how to fix.

GitLab CI Logs:

Running with gitlab-runner 16.6.1 ()
  on AppleSilicon, system ID:
Resolving secrets 00:00
Preparing the "custom" executor 00:12
Using Custom executor...
2023/12/23 01:56:42 Pulling the latest version of ghcr.io/cirruslabs/macos-sonoma-xcode:15.1...
2023/12/23 01:56:43 Cloning and configuring a new VM...
2023/12/23 01:56:43 Waiting for the VM to boot and be SSH-able...
2023/12/23 01:56:53 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.___:49241->192.168.64.3:22: read: connection reset by peer
ERROR: Job failed: exit status 1

I've check the router table, the host record for vm is being set to beidge100. Which is expected I guess.

192.168.64         link#21            UC              bridge100      !
192.168.64.1      some mac address  UHLWI                 lo0       
192.168.64.3      some mac address  UHLWIi          bridge100    755

By executing inside the Terminal.app with ssh to ip, I was able to connect to the vm using the same credential.

The current workaround is to clone another vm using tart command. The ip address of the vm will changed after that and, may because of this, GitLab CI will work again.

# example:
tart clone ghcr.io/cirruslabs/macos-sonoma-xcode:15.1 demo
# tart run demo # is not needed

I was wondering if any of stuff below would help.

The code stops here: https://github.com/cirruslabs/gitlab-tart-executor/blob/main/internal/tart/vm.go#L176

    if err := retry.Do(func() error {
        dialer := net.Dialer{}

        netConn, err = dialer.DialContext(ctx, "tcp", addr)

        return err
    }, retry.Context(ctx)); err != nil {
// ------>> return nil, fmt.Errorf("%w: failed to connect via SSH: %v", ErrVMFailed, err)
    }
edigaryev commented 11 months ago

What your .gitlab-ci.yml and GitLab Runner's config looks like?

Do I understand correctly that with the official Cirrus Labs images everything works fine?

Lakr233 commented 11 months ago

with the official Cirrus Labs images everything works fine

It works sometimes (the official image).

The problem is that: for an unknown condition, it will stop working (not be able to SSH into the tart machine) showing error as above.

Step I've tried to solve the issue (but didn't fix it):

* If I clone the machine with the same tag (using command above), it will work for a period of time and then stop working again for magic.

My router is proxying my data using dhcp profile (announce gateway to virtual networking device), so I've checked the route table as above, it (vm network interface: beidge100) did show up as destination in record.

Inside repo's .gitlab-ci.yml looks like:

image: ghcr.io/cirruslabs/macos-sonoma-xcode:15.1

stages:
- CompileFramework

CompileFramework:
  tags:
    - xcode
  only: ...
  stage: CompileFramework
  script: ...
  artifacts: ...

GitLab Runner's config is copy and pasted from your document. (Currently I do not have access to that, will update if you need that when I'm back to work)

Sorry for the misleading statement in previous reply, I stay up toooooo late that day.

edigaryev commented 11 months ago
  • reinstall macOS

Let's go from here:

  1. Which macOS version are you running? And on which hardware platform?
  2. Are there other any customizations done to the base OS? For example, tweaking firwall or "Internet Sharing" settings.
  3. Is there any other networking/virtualization-specific software installed? For example, UTM.

Also, which Tart version are you using?

Lakr233 commented 11 months ago

Which macOS version are you running? And on which hardware platform?

macOS Sonoma 14.2.1 or (14.2) + Apple Silicon M1 Mac mini (1st gen) if I remember correct.

Are there other any customizations done to the base OS? For example, tweaking firwall or "Internet Sharing" settings.

Firewall is disabled, iCloud/Apple ID is not turned on, Internet Sharing is disabled. Remote Login is enabled and assigned to the one and only user. Software update is disabled, Screen Saver is disabled. The internet is configured as DHCP(auto) v4 and v6 using wired cable.

Is there any other networking/virtualization-specific software installed? For example, UTM.

I believe no. The only virtualization tool I've installed is tart. Networking software installed contains softnet on CI device, and Surge on another device inside local area network working as DHCP server assigning 10.1.1.x address to devices and used as gateway.

Also, which Tart version are you using?

Not checked yet, but one week ago installed using brew. So I guess it is the newest one.

fkorotkov commented 11 months ago

@Lakr233, tart --version output will help to validate you are running the latest version.

Also, is there any chance you are running more than 253 jobs a day? Could you please try to change the default DHCP lease time.

Also looking at the exact GitLab Runner's config will be useful since our docs contain several examples of the configs and will be great to see which one you use and if it has concurrency, softnet, etc. configured.

Lakr233 commented 10 months ago
ci@ci ~ % tart --version
2.4.2

Also, is there any chance you are running more than 253 jobs a day?

No, it's like 1-5 jobs each day.

Also looking at the exact GitLab Runner's config

ci@ci ~ % cat .gitlab-runner/config.toml 
concurrent = 1
check_interval = 5
shutdown_timeout = 10

[session_server]
  session_timeout = 3600

[[runners]]
  name = "AppleSilicon"
  url = "https://xxx.xxx.xxx/"
  id = 3
  token = "glrt-xxxxxxxx"
  token_obtained_at = 2023-12-15T16:50:07Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "custom"
  [runners.feature_flags]
    FF_RESOLVE_FULL_TLS_CHAIN = false
  [runners.custom]
    config_exec = "gitlab-tart-executor"
    config_args = ["config"]
    prepare_exec = "gitlab-tart-executor"
    prepare_args = ["prepare", "--concurrency", "1", "--cpu", "auto", "--memory", "auto"]
    run_exec = "gitlab-tart-executor"
    run_args = ["run"]
    cleanup_exec = "gitlab-tart-executor"
    cleanup_args = ["cleanup"]
  [runners.cache]
    MaxUploadedArchiveSize = 0

Could the issue be addressed by initiating an SSH handshake to all bridge networks when a connection failure occurs? The idea is to identify available network interfaces, select those that could potentially be virtual machine bridges, and resend the data packets through these interfaces. This could potentially be a solution; however, I lack proficiency in Go language, preventing me from implementing and testing this idea

edigaryev commented 10 months ago

Could the issue be addressed by initiating an SSH handshake to all bridge networks when a connection failure occurs? The idea is to identify available network interfaces, select those that could potentially be virtual machine bridges, and resend the data packets through these interfaces.

I'm not sure how that would help, because the Tart finds the VMs IP by looking for the freshest lease that matches the VMs MAC-address in the /var/db/dhcpd_leases file and then attempts to SSH to that IP.

This has not much to do with bridge interfaces, if you're not creating them manually.

I think the best way to diagnose this would be to set TART_EXECUTOR_HEADLESS to false and check the networking status from inside of the VM (check which IP was assigned and then try to connect via SSH from host manually) when this issue happens again.

Lakr233 commented 10 months ago

I've noticed a significant number of records within the /var/db/dhcpd_leases directory. If I understand correctly, the presence of a record indicates that it's currently in use, and I've counted approximately 40 such records. However, this seems unusual because I've only utilized one virtual machine this week. Therefore, I suspect there may be some issue or error.

I will proceed to follow the guide mentioned earlier to see if it can help resolve the issue. I will keep you updated and get back to you if I discover anything new.

https://tart.run/faq/#changing-the-default-dhcp-lease-time

Lakr233 commented 10 months ago

I have just discovered that deleting the lease file effectively resolves the issue. However, after executing another run, the same issue reoccurs.

image
edigaryev commented 10 months ago

I have just discovered that deleting the lease file effectively resolves the issue. However, after executing another run, the same issue reoccurs.

If this reproduces so easily, you should definetely try the last paragraph in https://github.com/cirruslabs/gitlab-tart-executor/issues/49#issuecomment-1884621454.

Lakr233 commented 10 months ago

I've connected to the vm by removing the clean-up step from the GitLab runner configuration and interacting with it via VNC. While the IP address is correctly assigned, the target executor continues to present the same issue.

I've used 10.155.1.1 as Shared_Net_Address, and 255.255.0.0 as Shared_Net_Mask.

2024/01/11 01:40:21 Pulling the latest version of ghcr.io/cirruslabs/macos-sonoma-xcode:15.1...
2024/01/11 01:40:23 Cloning and configuring a new VM...
2024/01/11 01:40:23 Waiting for the VM to boot and be SSH-able...
2024/01/11 01:40:33 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49751->10.155.0.2:22: read: connection reset by peer

Screenshot 2024-01-11 at 01 41 15

Screenshot 2024-01-11 at 01 42 18
edigaryev commented 10 months ago
2024/01/11 01:40:33 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49751-10.155.0.2:22: read: connection reset by peer

It seems like the the standard 10 SSH connection attempts is not enough.

I've created https://github.com/cirruslabs/gitlab-tart-executor/pull/52 to address this.

edigaryev commented 10 months ago

Please check out the new 1.5.1 release to see if it fixes the problem.

Lakr233 commented 10 months ago

I have tried to upgrade but unfortunately the problem still exists.

Running with gitlab-runner 16.7.0 (102c81ba)
Resolving secrets 00:00
Preparing the "custom" executor 00:12
Using Custom executor...
2024/01/11 14:25:45 Pulling the latest version of ghcr.io/cirruslabs/macos-sonoma-xcode:15.1...
2024/01/11 14:25:46 Cloning and configuring a new VM...
2024/01/11 14:25:46 Waiting for the VM to boot and be SSH-able...
2024/01/11 14:25:56 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49795->10.155.0.2:22: read: connection reset by peer
ERROR: Job failed: exit status 1
Lakr233 commented 10 months ago

I have resolved the problem.

After meticulous debugging of the code logic, I discovered that the root of the issue lies in the assumption that a successful connection to target:22 automatically implies the readiness of the SSH service on the target machine. However, this is not always the case. The retry mechanism you've implemented does not function as intended; it merely waits for a connection rather than ensuring the SSH service is ready.

The following log shows the problem.

Using Custom executor...
2024/01/11 17:18:07 Pulling the latest version of ghcr.io/cirruslabs/macos-sonoma-xcode:15.1...
2024/01/11 17:18:08 Cloning and configuring a new VM...
2024/01/11 17:18:08 Waiting for the VM to boot and be SSH-able...
successfully connected to 10.155.0.2:22 // <- My debug print here.
2024/01/11 17:18:18 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49812->10.155.0.2:22: read: connection reset by peer
ERROR: Job failed: exit status 1

To rectify this, I have integrated the initiation of a new SSH connection within the retry algorithm. As a result, the issue has now been definitively resolved.