cloudbase / garm

GitHub Actions Runner Manager
Apache License 2.0
126 stars 25 forks source link

Runner instance for .. is no longer on the provider, removing from github #91

Closed SystemKeeper closed 1 year ago

SystemKeeper commented 1 year ago

We are using LXD as a provider with garm and have max runners set to 20. After a few days you notice the following:

When looking into the garm log I see the following for one of the "ghost containers":

2023/06/02 08:06:25 creating instance garm-xKyVF4h7UsVp in pool f52a63ee-7af9-4a19-b7f0-4f1f67de1579
2023/06/02 08:10:05 Runner instance for garm-xKyVF4h7UsVp is no longer on the provider, removing from github
2023/06/02 08:10:06 Removing garm-xKyVF4h7UsVp from database

But the container is still on the provider:

# lxc list | grep xKyVF4h7UsVp
| garm-xKyVF4h7UsVp | RUNNING | 172.17.0.1 (docker0) | <...> (eth0) | CONTAINER | 0         |

When looking into the syslog of the container, I see this:

Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: -----------------------------
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]:  Finish Install Dependencies
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: -----------------------------
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: --------------------------------------------------------------------------------
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |        ____ _ _   _   _       _          _        _   _                      |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |                                                                              |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |                       Self-hosted runner registration                        |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: |                                                                              |
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: --------------------------------------------------------------------------------
Jun  2 08:09:58 garm-xKyVF4h7UsVp cloud-init[749]: # Authentication
Jun  2 08:09:59 garm-xKyVF4h7UsVp cloud-init[749]: Using V2 flow: False
Jun  2 08:10:00 garm-xKyVF4h7UsVp cloud-init[749]: √ Connected to GitHub
Jun  2 08:10:00 garm-xKyVF4h7UsVp cloud-init[749]: # Runner Registration
Jun  2 08:10:01 garm-xKyVF4h7UsVp cloud-init[749]: √ Runner successfully added
Jun  2 08:10:02 garm-xKyVF4h7UsVp cloud-init[749]: √ Runner connection is good
Jun  2 08:10:02 garm-xKyVF4h7UsVp cloud-init[749]: # Runner settings
Jun  2 08:10:02 garm-xKyVF4h7UsVp cloud-init[749]: √ Settings Saved.
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: Creating launch runner in /etc/systemd/system/actions.runner.<...>.garm-xKyVF4h7UsVp.service
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: Run as user: runner
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: Run as uid: 1001
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: gid: 1001
Jun  2 08:10:03 garm-xKyVF4h7UsVp systemd[1]: Reloading.
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: Created symlink /etc/systemd/system/multi-user.target.wants/actions.runner.<...>.garm-xKyVF4h7UsVp.service → /etc/systemd/system/actions.runner.<...>.garm-xKyVF4h7UsVp.service.
Jun  2 08:10:03 garm-xKyVF4h7UsVp systemd[1]: Reloading.
Jun  2 08:10:03 garm-xKyVF4h7UsVp systemd[1]: Started GitHub Actions Runner (<...>.garm-xKyVF4h7UsVp).
Jun  2 08:10:03 garm-xKyVF4h7UsVp runsvc.sh[4650]: .path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: /etc/systemd/system/actions.runner.<...>.garm-xKyVF4h7UsVp.service
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: ● actions.runner.<...>.garm-xKyVF4h7UsVp.service - GitHub Actions Runner (<...>.garm-xKyVF4h7UsVp)
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:      Loaded: loaded (/etc/systemd/system/actions.runner.<...>.garm-xKyVF4h7UsVp.service; enabled; vendor preset: enabled)
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:      Active: active (running) since Fri 2023-06-02 08:10:03 UTC; 14ms ago
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:    Main PID: 4650 (runsvc.sh)
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:       Tasks: 2 (limit: 154339)
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:      Memory: 2.1M
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:         CPU: 10ms
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:      CGroup: /system.slice/actions.runner.<...>.garm-xKyVF4h7UsVp.service
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:              ├─4650 /bin/bash /home/runner/actions-runner/runsvc.sh
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]:              └─4652 ./externals/node16/bin/node ./bin/RunnerService.js
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: Jun 02 08:10:03 garm-xKyVF4h7UsVp systemd[1]: Started GitHub Actions Runner (<...>.garm-xKyVF4h7UsVp).
Jun  2 08:10:03 garm-xKyVF4h7UsVp runsvc.sh[4652]: Starting Runner listener with startup type: service
Jun  2 08:10:03 garm-xKyVF4h7UsVp runsvc.sh[4652]: Started listener process, pid: 4665
Jun  2 08:10:03 garm-xKyVF4h7UsVp runsvc.sh[4652]: Started running service
Jun  2 08:10:03 garm-xKyVF4h7UsVp cloud-init[749]: Cloud-init v. 23.1.2-0ubuntu0~22.04.1 finished at Fri, 02 Jun 2023 08:10:03 +0000. Datasource DataSourceLXD.  Up 216.76 seconds
Jun  2 08:10:03 garm-xKyVF4h7UsVp systemd[1]: Finished Execute cloud user/final scripts.
Jun  2 08:10:03 garm-xKyVF4h7UsVp systemd[1]: Reached target Cloud-init target.
Jun  2 08:10:03 garm-xKyVF4h7UsVp systemd[1]: Startup finished in 3min 36.495s.
Jun  2 08:10:05 garm-xKyVF4h7UsVp runsvc.sh[4652]: √ Connected to GitHub
Jun  2 08:10:05 garm-xKyVF4h7UsVp runsvc.sh[4652]: Current runner version: '2.304.0'
Jun  2 08:10:05 garm-xKyVF4h7UsVp runsvc.sh[4652]: 2023-06-02 08:10:05Z: Listening for Jobs
Jun  2 08:10:07 garm-xKyVF4h7UsVp runsvc.sh[4652]: An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View permissions to perform the action.
Jun  2 08:10:07 garm-xKyVF4h7UsVp runsvc.sh[4652]: Runner listener exited with error code 2
Jun  2 08:10:07 garm-xKyVF4h7UsVp runsvc.sh[4652]: Runner listener exit with retryable error, re-launch runner in 5 seconds.
Jun  2 08:10:12 garm-xKyVF4h7UsVp runsvc.sh[4652]: Starting Runner listener with startup type: service
Jun  2 08:10:12 garm-xKyVF4h7UsVp runsvc.sh[4652]: Started listener process, pid: 4696
Jun  2 08:10:13 garm-xKyVF4h7UsVp runsvc.sh[4652]: √ Connected to GitHub
Jun  2 08:10:15 garm-xKyVF4h7UsVp runsvc.sh[4652]: Failed to create a session. The runner registration has been deleted from the server, please re-configure.
Jun  2 08:10:15 garm-xKyVF4h7UsVp runsvc.sh[4652]: Runner listener exited with error code 1
Jun  2 08:10:15 garm-xKyVF4h7UsVp runsvc.sh[4652]: Runner listener exit with terminated error, stop the service, no retry needed.
Jun  2 08:10:15 garm-xKyVF4h7UsVp systemd[1]: actions.runner.<...>.garm-xKyVF4h7UsVp.service: Deactivated successfully.
Jun  2 08:10:15 garm-xKyVF4h7UsVp systemd[1]: actions.runner.<...>.garm-xKyVF4h7UsVp.service: Consumed 3.187s CPU time.
Jun  2 08:11:29 garm-xKyVF4h7UsVp systemd[1]: Starting Download data for packages that failed at package install time...
Jun  2 08:11:29 garm-xKyVF4h7UsVp systemd[1]: update-notifier-download.service: Deactivated successfully.
Jun  2 08:11:29 garm-xKyVF4h7UsVp systemd[1]: Finished Download data for packages that failed at package install time.

So at first glance it looks like garm checked the runners on GitHub while the service inside the container was restarting? But garm should be able to detect that the container is still running on LXD. From the source code garm is checking against the provider, so not sure what could fail here?

gabriel-samfira commented 1 year ago

Was garm restarted by any chance while the runners were being bootstrapped?

gabriel-samfira commented 1 year ago

Running lxc list | grep garm- | wc -l counts 82 containers

This is concerning. Garm should clean up any runners from the provider if they are no longer on github. And should keep trying if it fails. I will investigate. Would it be possible to provide the full log (anonymized)? You could send it via email if you prefer. I would like to see what happened.

SystemKeeper commented 1 year ago

Would it be possible to provide the full log (anonymized)?

The full Garm log or the full syslog from the container or both?

SystemKeeper commented 1 year ago

Was garm restarted by any chance while the runners were being bootstrapped?

I don’t think it was, let me check if I can find something.

gabriel-samfira commented 1 year ago

The full Garm log or the full syslog from the container or both?

Full garm log. Sometimes I forget to use my outer voice :smile: . There should not be any sensitive info in the log, but if you spot anything, feel free to redact.

gabriel-samfira commented 1 year ago

2023/06/02 08:10:05 Runner instance for garm-xKyVF4h7UsVp is no longer on the provider, removing from github

This line is also interesting. The garm instance only shows up on github as a result of the runner actually starting up and running the self hosted runner app. We can see that in the console log.

The line I quoted denotes that garm could no longer find the actual VM/container in LXD. It saw it on github, looked for it in the provider and the provider (lxd in this case) returned a result that indicated the VM/containerd didn't exist. Which is absolutely should exist. Will look at the code.

gabriel-samfira commented 1 year ago

If you build garm using:

make build-static

Would you mind also running:

garm -version
SystemKeeper commented 1 year ago

We currently did build only via go install. The repo checkout is at this commit:

commit 702937f63602c6691e977857fa64dcde25551de0 (HEAD -> main, origin/main, origin/HEAD)
Author: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Date:   Thu Mar 30 09:00:46 2023 +0000

    Add github runner group in pool show

    Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

I can try static build and version later if that would help

SystemKeeper commented 1 year ago

You could send it via email if you prefer.

Send by mail, thanks :)

SystemKeeper commented 1 year ago

The line I quoted denotes that garm could no longer find the actual VM/container in LXD. It saw it on github, looked for it in the provider and the provider (lxd in this case) returned a result that indicated the VM/containerd didn't exist

I See in the code that the there's a cache for the instances used. Could that be a problem somehow? Like outdated cache, concurrency issue or something the like?

gabriel-samfira commented 1 year ago

The VM/container was created almost 5 minutes prior to that check. The cache is created every time cleanupOrphanedGithubRunners() is run. So that cache was created almost 5 minutes after the instance was created and should include all instances. Will check. The fact that you have a number of leftover VM/containers is also strange.

What version of LXD are you using. I want to have as much info to try to reproduce this.

gabriel-samfira commented 1 year ago

I think I may have found the issue (embarrassing as it is):

https://github.com/cloudbase/garm/pull/92

Would you mind pulling the latest commit and running a:

make build-static
SystemKeeper commented 1 year ago

Running on Ubuntu 22.04 with LXD snap install (5.14) with ZFS pool and shiftfs enabled. Runners have a profile with 2 CPUs, 7GB RAM and 15 GB Disk (as with the default GitHub action runners).

``` # lxc info config: {} api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls environment: addresses: [] architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- ... r1auw2+ms34Ivm1+vw== -----END CERTIFICATE----- certificate_fingerprint: ... driver: lxc | qemu driver_version: 5.0.2 | 8.0.0 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "true" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-72-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: ... server_pid: 3051352 server_version: "5.14" storage: zfs storage_version: 2.1.5-1ubuntu6~22.04.1 storage_supported_drivers: - name: cephobject version: 17.2.5 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0 remote: false - name: zfs version: 2.1.5-1ubuntu6~22.04.1 remote: false - name: btrfs version: 5.16.2 remote: false - name: ceph version: 17.2.5 remote: true - name: cephfs version: 17.2.5 remote: true ```

Pool config:

| ID                       | f52a63ee-7af9-4a19-b7f0-4f1f67de1579                       |
| Provider Name            | lxd_local                                                  |
| Image                    | gh-ubuntu22-20230507                                       |
| Flavor                   | runner                                                     |
| OS Type                  | linux                                                      |
| OS Architecture          | amd64                                                      |
| Max Runners              | 20                                                         |
| Min Idle Runners         |                                                            |
| Runner Bootstrap Timeout |                                                            |
| Tags                     | self-hosted, x64, Linux, ubuntu-latest, ubuntu-22.04       |
| Belongs to               | .........                                                  |
| Level                    | org                                                        |
| Enabled                  | true                                                       |
| Runner Prefix            | garm                                                       |
| Extra specs              |                                                            |
| GitHub Runner Group      |                                                            |
SystemKeeper commented 1 year ago

Would you mind pulling the latest commit and running a:

make build-static

Sure, thanks again! Out of curiosity: Any reason to prefer make build-static vs. just having the go install?

gabriel-samfira commented 1 year ago

Using make build-static will build against musl on alpine. That binary does not depend on glibc and is fully static. It can run on any Linux system, regardless of glibc version. We can't really compile a fully static binary against glibc if gethostbyname() is involved and that can lead to segmentation faults if we build on newer versions of glibc, and try to run the binary on systems with an older version that may be ABI incompatible.

Normally this is not an issue. In most cases you don't have to run your binary on an ancient version of Linux, but there are some environments out there that still run CentOS 6/7.

SystemKeeper commented 1 year ago

I See, thanks for the explanation. Right now we do not have docker installed on the host machine, that's why we did not use make build-static in the first place. I'll put this on my todo.

gabriel-samfira commented 1 year ago

Podman also works (in case you don't want a running daemon). You can build it on any machine. You can of course also use go install, or:

go build -mod vendor \
    -o garm -tags osusergo,netgo,sqlite_omit_load_extension \
    -ldflags \
    " -s -w -X main.Version=$(git describe --always --dirty)" ./cmd/garm
gabriel-samfira commented 1 year ago

I removed the link mode external bit, as you don't need it, but the rest will give you a smaller binary and also add the version to the binary.

SystemKeeper commented 1 year ago

So, containers deleted, garm updated, let's see what happens :)

gabriel-samfira commented 1 year ago

Ohh. It would have been nice to leave the containers and see if garm cleans them up 😄 . No worries. Ideally this bug is gone.

SystemKeeper commented 1 year ago

Oh, I thought this would not work 😅

gabriel-samfira commented 1 year ago

Baring any silly bugs in the provider (like the one just fixed) it should do it's best to cleanup both in github and in the provider.

If you manually delete a runner from lxd, after about 5 minutes it runs a cleanup function that detects orphaned runners and cleans them up from github. Same is true if you manually remove a runner from github.

SystemKeeper commented 1 year ago

Looking good after one day:

# lxc list | grep garm- | wc -l
20

Thanks again!

gabriel-samfira commented 1 year ago

Awesome! Feel free to open a new issue if you spot any weirdness.