Closed cfergeau closed 2 months ago
One initial issue is https://github.com/containers/krunkit/issues/8 - krunkit is currently only available on Apple Silicon machines, it's not available for Intel-based macs.
krunkit
does not accept certain arguments such as --kernel
and --kernel-cmdline
which are currently being used by crc to start a vfkit
machine. These arguments can be removed if the boot mode is changed to UEFI (The issue: https://github.com/crc-org/crc/issues/4180). Addressing this first.
@vyasgun but at least it is tried without those options?
@praveenkumar I appreciate the initiative to create the PR for using UEFI with vfkit
. Running the VM without those options could have only be tried with the said code changes. Another flag that needs to be removed for krunkit VM is --timesync
.
I have tried using the new options with krunkit
and there has been progress. The VM process is running but there is some issue with the virtuo-net
device.
If I am correct, according to the code, the device is only being added to vfkit when system mode networking is used:
Can you confirm this? And its relevance to the networking modes? (Please note, I have added this by changing my personal fork of the codebase here: https://github.com/vyasgun/crc/tree/spike/uefi but I have some questions/clarifications I need)
Apologies if the question is too naive but there's not much documentation to follow :)
If I am correct, according to the code, the device is only being added to vfkit when system mode networking is used:
you can remove virtio-net option because we are not allowing system-mode networking for mac and it is not even tested.
Another flag that needs to be removed for krunkit VM is --timesync.
This needs some more digging to provide a better answer, but for time being (for poc) if something work without it that should be a progress (also check how podman-machine handle time sync).
With all those changes are you able to run the VM with krunkit
and provision cluster (microshift/openshift)? If yes, does it have advantage over vfkit (in terms of performance)?
tmesync was due to a problem with the sleep/idle state of the VM. it might need some more investigation in general to determine if this time skewing still happens. In conclusion; leave this out for now; will need a new issue.
This needs some more digging to provide a better answer, but for time being (for poc) if something work without it that should be a progress (also check how podman-machine handle time sync).
podman-machine is not using --timesync
in both vfkit and krunkit. A little more digging into is required. However, virtio-net is being used and it would be helpful for me to understand a slightly more detailed explanation on its relevance in our usecase.
Yes, I can get the krunkit process running. The command being used:
podmanqe@dev-platform-mac4 ~ % /opt/homebrew/bin/krunkit --cpus 2 --memory 4096 --bootloader efi,variable-store=/Users/podmanqe/.crc/machines/crc/efistore.nvram,create --device virtio-fs,sharedDir=/Users/podmanqe,mountTag=dir0 --device virtio-rng --device virtio-blk,path=/Users/podmanqe/.crc/machines/crc/crc.img --device virtio-vsock,port=1024,socketURL=/Users/podmanqe/.crc/tap.sock,listen --restful-uri tcp://localhost:8080
podmanqe@dev-platform-mac4 ~ % curl 127.0.0.1:8080 --output -
{"state": "VirtualMachineStateRunning"}%
you can remove virtio-net option because we are not allowing system-mode networking for mac and it is not even tested.
krunkit
goes to the api login page to manually enter the password. I just want to be sure if not using virtio-net
might be affecting this.
krunkit goes to the api login page to manually enter the password. I just want to be sure if not using virtio-net might be affecting this.
This is when you are trying to run it directly using cli command, does it work when you change the crc code base and use krunkit binary instead vfkit? I think with cli it is expected since no ssh key is passed.
This is when you are trying to run it directly using cli command, does it work when you change the crc code base and use krunkit binary instead vfkit? I think with cli it is expected since no ssh key is passed.
No, it doesn't seamlessly run through CRC code base as of now which is why I am trying to figure out the required options. Except this part, the machine is in running state
as mentioned in my previous comment. Either the ssh settings or ignition config. podman-machine
logs in directly (it is using podman-machine-default-ignition.sock) as during its startup, a certain set of commands is executed.
Can you still point me to the use of virtio-net and why is it only used for system mode networking? It will be helpful for me. Thanks :)
Can you still point me to the use of virtio-net and why is it only used for system mode networking?
Before migrating to vfkit we used to use the hyperkit ( https://github.com/moby/hyperkit ) as driver and it was using virtio-net but that didn't provide us way to effectively handle the vpn connections so we went with https://github.com/containers/gvisor-tap-vsock (user-mode networking) and have support for both but slowly made this as default networking solution by obsoleting virtio-net and we are not even testing it any more.
More info around virtio-net : https://www.redhat.com/en/blog/introduction-virtio-networking-and-vhost-net
No, it doesn't seamlessly run through CRC code base as of now which is why I am trying to figure out the required options.
To me, this machine is booted and sshd service should be running I am more interested in now if you just rename the krunkit to vfkit and try crc start --log-level debug
what issue you get as error.
I was able to bring up the crc VM using the following changes: https://github.com/vyasgun/crc/commit/6eafcf67f21507c8c45395f74c94c6d00b8f7491 (Please note it's just a POC with some hardcode just for testing purposes)
Verifying it's using krunkit:
podmanqe@dev-platform-mac4 ~ % crc config view
- consent-telemetry : no
- cpus : 4
- memory : 16384
- preset : microshift
- skip-check-vfkit-installed : true
podmanqe@dev-platform-mac4 crc % crcssh
Warning: Permanently added '[127.0.0.1]:2222' (ED25519) to the list of known hosts.
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Boot Status is GREEN - Health Check SUCCESS
[core@api ~]$ ls /dev/dri
by-path card0 renderD128
I also ran an InstructLab pod on CRC with the following spec and made it run some prompts by using an interactive terminal ( kubectl exec -ti mistral-pod -- bash
). The prompts are working but the responses are very slow compared to podman-machine
using krunkit.
podmanqe@dev-platform-mac4 gunjan % cat mistral-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: mistral-pod
spec:
containers:
- image: quay.io/slopezpa/fedora-vgpu-llama
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 300; done;" ]
name: mistral-pod
volumeMounts:
- mountPath: /dev/dri
name: dev-dri
- mountPath: /models
name: downloads
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: dev-dri
hostPath:
path: /dev/dri
- name: downloads
hostPath:
path: /Users/podmanqe/Downloads
However, crc status
doesn't show the VM as running so the proper changes for everything to work in sync need to be looked into even though it can be ssh'd into.
podmanqe@dev-platform-mac4 ~ % crc status
CRC VM: Stopped
MicroShift: Stopped (v4.16.4)
RAM Usage: 0B of 0B
Disk Usage: 0B of 0B (Inside the CRC VM)
Persistent Volume Usage: 0B of 0B (Allocated)
Cache Usage: 67.12GB
Cache Directory: /Users/podmanqe/.crc/cache
According to the spike, CRC can use krunkit
. The next steps depend on if we want to simply replace vfkit with krunkit in our code or we want to support it along with vfkit. The code changes seem straightforward.
podman-machine is not using
--timesync
in both vfkit and krunkit. A little more digging into is required.
They are using https://chrony-project.org/doc/4.5/chrony.conf.html#makestep instead: https://github.com/containers/podman-machine-os/blob/main/podman-image-daily/50-podman-makestep.conf
I also ran an InstructLab pod on CRC with the following spec
Did you use the same yaml with podman-machine for comparison? For a start, you could ssh into the crc krunkit VM, and run an AI workload by directly using podman ...
@cfergeau Yes, it's the same yaml. I tried running the llama.cpp code in the following ways and here are the results (For reference: https://github.com/ggerganov/llama.cpp/discussions/1323#discussioncomment-5916462 has the following list which describes the parameters):
- load time: loading model file
- sample time: generating tokens from the prompt/file choosing the next likely token.
- prompt eval time: how long it took to process the prompt/file by LLaMa before generating new text.
- eval time: how long it took to generate the output (until [end of text] or the user set limit).
- total: all together
lama_print_timings: load time = 4430.66 ms
llama_print_timings: sample time = 16.25 ms / 259 runs ( 0.06 ms per token, 15937.48 tokens per second)
llama_print_timings: prompt eval time = 1631.53 ms / 5 tokens ( 326.31 ms per token, 3.06 tokens per second)
llama_print_timings: eval time = 12403.26 ms / 258 runs ( 48.07 ms per token, 20.80 tokens per second)
llama_print_timings: total time = 14076.18 ms / 263 tokens
llama_print_timings: load time = 3422.64 ms
llama_print_timings: sample time = 50.78 ms / 649 runs ( 0.08 ms per token, 12781.38 tokens per second)
llama_print_timings: prompt eval time = 1780.76 ms / 5 tokens ( 356.15 ms per token, 2.81 tokens per second)
llama_print_timings: eval time = 38451.63 ms / 648 runs ( 59.34 ms per token, 16.85 tokens per second)
llama_print_timings: total time = 40348.54 ms / 653 tokens
llama_print_timings: load time = 45553.22 ms
llama_print_timings: sample time = 43.01 ms / 563 runs ( 0.08 ms per token, 13089.37 tokens per second)
llama_print_timings: prompt eval time = 44973.51 ms / 9 tokens ( 4997.06 ms per token, 0.20 tokens per second)
llama_print_timings: eval time = 4552762.02 ms / 562 runs ( 8101.00 ms per token, 0.12 tokens per second)
llama_print_timings: total time = 4602622.83 ms / 571 tokens
Running a kubernetes pod on crc (takes much longer):
Could it be picking up an amd64 image instead of an arm64? This would explain the problems. You could try to get a shell inside the pod to try to understand what's happening, or try to compare commandlines in the VM to see if there are obvious differences
@cfergeau The image is arm64 (i checked inside the VM)
[core@api ~]$ sudo crictl inspecti quay.io/slopezpa/fedora-vgpu-llama | jq -r '.info.imageSpec.architecture'
arm64
And also inside the mistral-pod, the binary being run is built for arm64:
gvyas@Gunjans-MacBook-Pro specs % kubectl logs -f mistral-pod
Log start
main: build = 2238 (56d03d92)
main: built with cc (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6) for aarch64-redhat-linux
Running the pod as privileged was required for accessing the gpu. Now it takes roughly the same amount of time.
llama_print_timings: load time = 4772.51 ms
llama_print_timings: sample time = 58.51 ms / 669 runs ( 0.09 ms per token, 11433.36 tokens per second)
llama_print_timings: prompt eval time = 1780.24 ms / 5 tokens ( 356.05 ms per token, 2.81 tokens per second)
llama_print_timings: eval time = 40126.60 ms / 668 runs ( 60.07 ms per token, 16.65 tokens per second)
llama_print_timings: total time = 42043.88 ms / 673 tokens
The next steps will be documented in: https://github.com/crc-org/crc/issues/4341
krunkit is a drop-in replacement for vfkit from a cmdline argument point of view. podman-machine can make use of it, see https://docs.google.com/document/d/1IZCWAY5zMHqd0YlbnpGtCe7HNeWKQNHi8RuhujAJmg0/edit for some details.
Since it has additional features compared to vfkit, it would be interesting to know if crc can make use of it.
In order to test krunkit + crc, a few steps that come to mind:
checkVfkitInstalled
inpreflight_checks_darwin.go
needs to be skipped or adjusted as it contains a vfkit version check which likely won't work with krunkit (different version number). There must be acrc config set skip-xxxx
option to avoid this codeNewVfkitCache
code and related methods in cache_darwin.go will need to be changed (but I think this code won't be run during testing).