Closed Lakr233 closed 10 months ago
What your .gitlab-ci.yml
and GitLab Runner's config looks like?
Do I understand correctly that with the official Cirrus Labs images everything works fine?
with the official Cirrus Labs images everything works fine
It works sometimes (the official image).
The problem is that: for an unknown condition, it will stop working (not be able to SSH into the tart machine) showing error as above.
Step I've tried to solve the issue (but didn't fix it):
softnet
and pass the environment variable* If I clone the machine with the same tag (using command above), it will work for a period of time and then stop working again for magic.
My router is proxying my data using dhcp profile (announce gateway to virtual networking device), so I've checked the route table as above, it (vm network interface: beidge100) did show up as destination in record.
Inside repo's .gitlab-ci.yml
looks like:
image: ghcr.io/cirruslabs/macos-sonoma-xcode:15.1
stages:
- CompileFramework
CompileFramework:
tags:
- xcode
only: ...
stage: CompileFramework
script: ...
artifacts: ...
GitLab Runner's config is copy and pasted from your document. (Currently I do not have access to that, will update if you need that when I'm back to work)
Sorry for the misleading statement in previous reply, I stay up toooooo late that day.
- reinstall macOS
Let's go from here:
Also, which Tart version are you using?
Which macOS version are you running? And on which hardware platform?
macOS Sonoma 14.2.1 or (14.2) + Apple Silicon M1 Mac mini (1st gen) if I remember correct.
Are there other any customizations done to the base OS? For example, tweaking firwall or "Internet Sharing" settings.
Firewall is disabled, iCloud/Apple ID is not turned on, Internet Sharing is disabled. Remote Login is enabled and assigned to the one and only user. Software update is disabled, Screen Saver is disabled. The internet is configured as DHCP(auto) v4 and v6 using wired cable.
Is there any other networking/virtualization-specific software installed? For example, UTM.
I believe no. The only virtualization tool I've installed is tart. Networking software installed contains softnet
on CI device, and Surge
on another device inside local area network working as DHCP server assigning 10.1.1.x address to devices and used as gateway.
Also, which Tart version are you using?
Not checked yet, but one week ago installed using brew. So I guess it is the newest one.
@Lakr233, tart --version
output will help to validate you are running the latest version.
Also, is there any chance you are running more than 253 jobs a day? Could you please try to change the default DHCP lease time.
Also looking at the exact GitLab Runner's config will be useful since our docs contain several examples of the configs and will be great to see which one you use and if it has concurrency
, softnet
, etc. configured.
ci@ci ~ % tart --version
2.4.2
Also, is there any chance you are running more than 253 jobs a day?
No, it's like 1-5 jobs each day.
Also looking at the exact GitLab Runner's config
ci@ci ~ % cat .gitlab-runner/config.toml
concurrent = 1
check_interval = 5
shutdown_timeout = 10
[session_server]
session_timeout = 3600
[[runners]]
name = "AppleSilicon"
url = "https://xxx.xxx.xxx/"
id = 3
token = "glrt-xxxxxxxx"
token_obtained_at = 2023-12-15T16:50:07Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "custom"
[runners.feature_flags]
FF_RESOLVE_FULL_TLS_CHAIN = false
[runners.custom]
config_exec = "gitlab-tart-executor"
config_args = ["config"]
prepare_exec = "gitlab-tart-executor"
prepare_args = ["prepare", "--concurrency", "1", "--cpu", "auto", "--memory", "auto"]
run_exec = "gitlab-tart-executor"
run_args = ["run"]
cleanup_exec = "gitlab-tart-executor"
cleanup_args = ["cleanup"]
[runners.cache]
MaxUploadedArchiveSize = 0
Could the issue be addressed by initiating an SSH handshake to all bridge networks when a connection failure occurs? The idea is to identify available network interfaces, select those that could potentially be virtual machine bridges, and resend the data packets through these interfaces. This could potentially be a solution; however, I lack proficiency in Go language, preventing me from implementing and testing this idea
Could the issue be addressed by initiating an SSH handshake to all bridge networks when a connection failure occurs? The idea is to identify available network interfaces, select those that could potentially be virtual machine bridges, and resend the data packets through these interfaces.
I'm not sure how that would help, because the Tart finds the VMs IP by looking for the freshest lease that matches the VMs MAC-address in the /var/db/dhcpd_leases
file and then attempts to SSH to that IP.
This has not much to do with bridge interfaces, if you're not creating them manually.
I think the best way to diagnose this would be to set TART_EXECUTOR_HEADLESS
to false
and check the networking status from inside of the VM (check which IP was assigned and then try to connect via SSH from host manually) when this issue happens again.
I've noticed a significant number of records within the /var/db/dhcpd_leases directory. If I understand correctly, the presence of a record indicates that it's currently in use, and I've counted approximately 40 such records. However, this seems unusual because I've only utilized one virtual machine this week. Therefore, I suspect there may be some issue or error.
I will proceed to follow the guide mentioned earlier to see if it can help resolve the issue. I will keep you updated and get back to you if I discover anything new.
I have just discovered that deleting the lease file effectively resolves the issue. However, after executing another run, the same issue reoccurs.
I have just discovered that deleting the lease file effectively resolves the issue. However, after executing another run, the same issue reoccurs.
If this reproduces so easily, you should definetely try the last paragraph in https://github.com/cirruslabs/gitlab-tart-executor/issues/49#issuecomment-1884621454.
I've connected to the vm by removing the clean-up step from the GitLab runner configuration and interacting with it via VNC. While the IP address is correctly assigned, the target executor continues to present the same issue.
I've used 10.155.1.1 as Shared_Net_Address, and 255.255.0.0 as Shared_Net_Mask.
2024/01/11 01:40:21 Pulling the latest version of ghcr.io/cirruslabs/macos-sonoma-xcode:15.1...
2024/01/11 01:40:23 Cloning and configuring a new VM...
2024/01/11 01:40:23 Waiting for the VM to boot and be SSH-able...
2024/01/11 01:40:33 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49751->10.155.0.2:22: read: connection reset by peer
2024/01/11 01:40:33 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49751-10.155.0.2:22: read: connection reset by peer
It seems like the the standard 10 SSH connection attempts is not enough.
I've created https://github.com/cirruslabs/gitlab-tart-executor/pull/52 to address this.
Please check out the new 1.5.1
release to see if it fixes the problem.
I have tried to upgrade but unfortunately the problem still exists.
Running with gitlab-runner 16.7.0 (102c81ba)
Resolving secrets 00:00
Preparing the "custom" executor 00:12
Using Custom executor...
2024/01/11 14:25:45 Pulling the latest version of ghcr.io/cirruslabs/macos-sonoma-xcode:15.1...
2024/01/11 14:25:46 Cloning and configuring a new VM...
2024/01/11 14:25:46 Waiting for the VM to boot and be SSH-able...
2024/01/11 14:25:56 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49795->10.155.0.2:22: read: connection reset by peer
ERROR: Job failed: exit status 1
I have resolved the problem.
After meticulous debugging of the code logic, I discovered that the root of the issue lies in the assumption that a successful connection to target:22 automatically implies the readiness of the SSH service on the target machine. However, this is not always the case. The retry mechanism you've implemented does not function as intended; it merely waits for a connection rather than ensuring the SSH service is ready.
The following log shows the problem.
Using Custom executor...
2024/01/11 17:18:07 Pulling the latest version of ghcr.io/cirruslabs/macos-sonoma-xcode:15.1...
2024/01/11 17:18:08 Cloning and configuring a new VM...
2024/01/11 17:18:08 Waiting for the VM to boot and be SSH-able...
successfully connected to 10.155.0.2:22 // <- My debug print here.
2024/01/11 17:18:18 VM errored: failed to connect via SSH: ssh: handshake failed: read tcp 10.1.1.114:49812->10.155.0.2:22: read: connection reset by peer
ERROR: Job failed: exit status 1
To rectify this, I have integrated the initiation of a new SSH connection within the retry algorithm. As a result, the issue has now been definitively resolved.
I have no idea about this issue. Neither about what's happening and how to fix.
GitLab CI Logs:
I've check the router table, the host record for vm is being set to beidge100. Which is expected I guess.
By executing inside the Terminal.app with ssh to ip, I was able to connect to the vm using the same credential.
The current workaround is to clone another vm using tart command. The ip address of the vm will changed after that and, may because of this, GitLab CI will work again.
I was wondering if any of stuff below would help.
The code stops here: https://github.com/cirruslabs/gitlab-tart-executor/blob/main/internal/tart/vm.go#L176