Apple Silicon support for Lilypad RPs

AquiGorka commented 3 months ago

@Zorlin can this be done within 1 week? If not we put a pin on it til later.

Zorlin commented 2 months ago

@Zorlin can this be done within 1 week? If not we put a pin on it til later.

In short no, I think this is a good idea for something we think is technically possible but I would like to focus on other things for now and come back to it.

What needs to be determined is reasonably simple:

Given an arbitrary Docker container, can I contact a server running on a particular port on the host of my Bacalhau node?

If the answer is yes, this is reasonably possible, if the answer is no (which I suspect due to the way Bacalhau sandboxes Docker further than normal Docker) it can't be done (yet?)

Either way though I think it's more than a day of searching to find answers on this which is probably not worth it right at this moment during testnet hell

Zorlin commented 2 months ago

I looked into this today to figure out how "possible" it is and get some better insights into it.

You can see the work here -

https://github.com/Zorlin/lilypad-comfyui-proxy

but the guts of it is really just running this shell script inside Docker:

#!/bin/sh

# Determine the host IP address from the special DNS name `host.docker.internal`
HOST_IP=$(getent hosts host.docker.internal | awk '{ print $1 }')

# Run the curl command using the resolved host IP
curl http://$HOST_IP:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}'

The idea of the shell script is simple - it contacts host.docker.internal and talks to the Ollama API and tries to run an LLM using it. We'd actually want to use ComfyUI for this, but since I already had Ollama working on my machine with models downloaded I figured I'd go with the quicker shortcut and hack it into place, at least to see if in theory we could contact a host program running on the host from within standard Docker.

I docker build .'d the Dockerfile, and ran it directly by image ID: ╰─$ docker run 6c4c629b656d12a38a5f4ffa90bb827ef1a45450e8118641690f8463ed1eb5c7 and got a bunch of continuous output from llama2.

Okay, so that's a good start. We can do basic control over an API from within Docker, which is mostly expected but was pleasantly easy to implement, especially given we don't have to guess an IP that might change and have a nice DNS name we can use.

https://docs.bacalhau.org/setting-up/jobs/job-specification/network specifies some stuff about how the network spec works; I have since worked out that in order to make this work reasonably securely we will want to use something like this in Spec:

            "Network": {
                "Type": "HTTP",
                "Domains": ["host.docker.internal"]
            },

Among the problems/things to note with this, however -

(Easy) We will need to have Apple Silicon users specifically allowlist host.docker.internal
(Easy but requires a code change) In order for host.docker.internal to be considered a valid domain name, we need to loosen the restrictions slightly in the regex used in the Bacalhau job runner - https://github.com/Lilypad-Tech/lilypad/blob/7cedeab42a7bb152ce85faeb54885de6e91f334c/pkg/data/bacalhau/job.go#L161 - however, I believe this code isn't even actually used anywhere as I was still able to get a job running (albeit on the wrong node!) even while specifying this domain name as such (ref https://github.com/Zorlin/lilypad-comfyui-proxy).

Things we learnt today:

The solver does not care for Bacalhau node labels when scheduling jobs onto RPs. I consider this a bug and we should fix it :)
This is likely possible, just a bit of a pain.

Blockers:

I was unable to get my M2 Air (24GiB Unified) to specifically run the job that would allow me to test this end-to-end inside Lilypad. It kept running on Strawberry. I would normally use node labels to filter out anything non-Apple, but unfortunately as mentioned above the Solver does not actually care and will happily still throw jobs at the wrong node (which will in theory just reject the job). I shut down Strawberry temporarily to get around this, but then the job wouldn't run at all - for some reason my Mac running an RP was not considered for the job. Once I can work around this I believe I can verify once and for all that this is possible - and if it's possible, I estimate it to be 1-2 days effort to get working 🎉

Chance of Mac inference in our near future: call it 70%?

Lilypad-Tech / lilypad

Apple Silicon support for Lilypad RPs #60