abiosoft / colima

Container runtimes on macOS (and Linux) with minimal setup
MIT License
18.52k stars 375 forks source link

Colima instance getting unresponsive when using Virtualization.framework alongside other VMs #1074

Open Raikerian opened 1 month ago

Raikerian commented 1 month ago

Description

Hi all. I am experiencing a weird issue where colima instance (using Virtualization.framework) is getting stuck and unresponsive while reporting a Running state. This happens on the machines where we run other Virtualization.framework VMs, 2 macOS virtual machines to be precise:

~ % colima ls
PROFILE    STATUS     ARCH       CPUS    MEMORY    DISK     RUNTIME    ADDRESS
default    Running    aarch64    2       2GiB      60GiB    docker
~ % colima status
FATA[0003] error retrieving current runtime: empty value
~ % colima ssh
FATA[0006] exit status 255
~ % docker ps
Cannot connect to the Docker daemon at unix:///Users/master/.colima/default/docker.sock. Is the docker daemon running?

One of the suspicion was a memory pressure, so I have limited other VMs to keep 4 spare GiB (total) of host memory, yet it still happening. While I am trying to find a way to consistently reproduce this, would love to hear some opinions and suggestions on what can I try to avoid having this issue at all.

Thanks.

Version

colima version 0.6.9 git commit: c3a31ed05f5fab8b2cdbae835198e8fb1717fd0f limactl version 0.22.0 qemu-img version 9.0.1

Operating System

Output of colima status

~ % colima status FATA[0003] error retrieving current runtime: empty value

Reproduction Steps

  1. Run 2 macOS VMs using Virtualization.framework alongside colima using Virtualization framework.
  2. Use all available CPU and All available memory -4 GiB by macOS VMs.
  3. Utilize macOS VMs with some performance job.
  4. Colima will be stuck eventually, but not consistently. Happens on a big scale with different machines.

Expected behaviour

Colima instance not being stuck or at least report some resource pressure for debugging.

Additional context

No response

abiosoft commented 1 month ago

Can you try the latest development version?

# install development version
brew install --head colima

# delete existing profile
colima delete

# start afresh
colima start

If your machine is currently freezing, do a restart before running the commands above.

Raikerian commented 1 month ago

Let me try latest develop with the same setup (2 vms with allocated full cpu and memory -4) and get back to you then. Any other logs or some kind of mem dump I can get from colima instance itself?

Raikerian commented 1 month ago

Happened on the latest head version as well:

~ % colima version
colima version HEAD-c8c4c5a
git commit: c8c4c5a69b4e422dab76ac0a0f81515094302c2e
~ % colima ls
PROFILE    STATUS     ARCH       CPUS    MEMORY    DISK     RUNTIME    ADDRESS
default    Running    aarch64    2       2GiB      60GiB    docker
~ % colima status
FATA[0002] error retrieving current runtime: empty value
~ % colima ssh
FATA[0006] exit status 255

This time I have confirmed that I only had 1 macOS Virtual Machine running alongside, which took 6 CPU (out of 12) and 14 memory (out of 32 available). It seems like its not related to the resource pressure, but rather something else. Any ideas?

Raikerian commented 1 month ago

Nothing in logs so far: ha.stderr.log:

{"level":"info","msg":"Not forwarding TCP 127.0.0.54:53","time":"2024-07-25T19:32:24+08:00"}
{"level":"info","msg":"Not forwarding TCP 127.0.0.53:53","time":"2024-07-25T19:32:24+08:00"}
{"level":"info","msg":"Not forwarding TCP [::]:22","time":"2024-07-25T19:32:24+08:00"}
{"level":"debug","msg":"stdout=\"\", stderr=\"+ timeout 30s bash -c 'until sudo diff -q /run/lima-boot-done /mnt/lima-cidata/meta-data 2\u003e/dev/null; do sleep 3; done'\\n\", err=\u003cnil\u003e","time":"2024-07-25T19:32:24+08:00"}
{"level":"info","msg":"The final requirement 1 of 1 is satisfied","time":"2024-07-25T19:32:24+08:00"}

ha.stdout.log:

{"time":"2024-07-25T19:32:14.639622+08:00","status":{"sshLocalPort":53292}}
{"time":"2024-07-25T19:32:24.930697+08:00","status":{"running":true,"sshLocalPort":53292}}

daemon.log:

time="2024-07-26T04:15:39+08:00" level=info msg="syncing inotify event for /Users/master/Library/Caches/mounts/e7082dfd-1386-40da-90f5-615a15d3de3b/repo/swiftlint " context=inotify
time="2024-07-26T04:31:07+08:00" level=error msg="error fetching docker volumes: error listing containers: error running [lima docker ps -q], output: \"\", err: \"exit status 255\"" context=inotify
abiosoft commented 1 month ago

Which macbook are you using?

Raikerian commented 1 month ago

Mac Mini M2 Pro

On Fri, 26 Jul 2024 at 18:18, Abiola Ibrahim @.***> wrote:

Which macbook are you using?

— Reply to this email directly, view it on GitHub https://github.com/abiosoft/colima/issues/1074#issuecomment-2252544147, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDQWXSXNGKIBKEVPJDL4KLZOIV7PAVCNFSM6AAAAABLOBBEOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGU2DIMJUG4 . You are receiving this because you authored the thread.Message ID: @.***>

abiosoft commented 1 month ago

If you do not need rosetta or faster host volume access, qemu is more stable.

Raikerian commented 1 month ago

Unfortunately I need a fast filesystem that comes with Virtualization.framework (virtiofs).

On Fri, 26 Jul 2024 at 20:57, Abiola Ibrahim @.***> wrote:

If you do not need rosetta or faster host volume access, qemu is more stable.

— Reply to this email directly, view it on GitHub https://github.com/abiosoft/colima/issues/1074#issuecomment-2252833140, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDQWXQIDXAVRIW5USLNU6LZOJITTAVCNFSM6AAAAABLOBBEOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSHAZTGMJUGA . You are receiving this because you authored the thread.Message ID: @.***>

Raikerian commented 1 month ago

After updating to this commit https://github.com/abiosoft/colima/commit/172580d679a67798ffe13f49a0204d38d7b24b88 I am no longer experiencing this issue for the last 2 days. But will continue testing if I am just lucky (before it was quite fast to reproduce).

abiosoft commented 1 month ago

Good to hear.

Looking forward to your further feedbacks.

Raikerian commented 1 month ago

After 100+ instances running for about 4 days, only got issue on one so far. So it's definitely more stable now, but can still happen.

abiosoft commented 1 month ago

After 100+ instances running for about 4 days, only got issue on one so far. So it's definitely more stable now, but can still happen.

If it happens this rarely, I think it is more than satisfactory.

I'll leave the issue open to gather possible feedbacks from others.

Thanks.