Open FrantaNautilus opened 3 months ago
Hi, thank you so much as always for the detailed analysis!
You're correct that Nvidia CTK and resource sharing is setup via the Cross Service files (I also call them .x.
files, haha).
In such a file nvidia
is a tag/label determined from the CLI environment, right now it's done by probing both nvidia-smi
and nvidia-container-toolkit
binaries in the PATH
We can expand the env detection logic to include something that'll be specific for CDI driver and add the compose.x.<service>.cdi.yml
files (by cloning from .nvidia.
with a script).
What I'd need for that is a way to probe such environment that wouldn't conflict with existing .nvidia.
probes:
command -v nvidia-smi &>/dev/null
command -v nvidia-container-toolkit &>/dev/null
If we can figure out such a probe for CDI systems - I'll be able to implement a support in one of the next releases
Thank you for considering this enhancement and of course for continuing to develop harbor
. I do not want to burden you with additional work, so I will do my best to find how to detect CDI and differentiate it from CTK approach. I will add information I find to this issue.
I am also not sure if I understand the detection mechanism, currently it is executed for each call to docker compose
. However, does it really need to be re-detected? Apart from cases when the device is a MUX switch equipped laptop (swicthes between Hybrid graphics and Integrated graphics) or user changes the Nvidia configuration, the detection result should not change. For this reason I thought that the process could be simplified by introducing a variable in harbor config file, which would be set during installation by means of automatic detection, but it would also allow for manual switch. Another advantage would be faster execution, which would not need to re-run several commands for detection.
I will do my best to find how to detect CDI and differentiate it from CTK approach
Thank you so much for the kind words and for the suggestion to scope this out, it'll make things much simpler, as I don't have a NIX system setup and only vaguely familiar with such setups atm, I hope there's a similar way to detect it based on the presence of a bin or a specific parseable output from something
need to be re-detected
I'd prefer it to be dynamic up until the point it starts causing performance issues in the CLI, this is something of a personal preference after building software for many years - resolve everything dynamically, cache/memoise when there are measurable issues
MUX switch equipped laptop, user changes the Nvidia configuration
These two items match me as a user, haha
That said, I don't see a problem in being able to turn automatic capability detection off, as a setting. After that, current harbor defaults
can already be abused a tiny bit by adding a custom service handle to match the CDI resource files:
# 1. Add "compose.x.service.cdi.yml" files with expected GPU resource sharing syntax as needed
# 2. Permanently add "cdi" handle to the list of services
harbor defaults add cdi
# 3. Up and other commands will match "cdi" overrides from "compose.x.service.cdi.yml" automatically
harbor up
and it'll also be very easy to add some config alias with manual capability overrides
I decided to make things simpler right away, latest main
has configurable capability detection, so you can turn it off immediately and use that instead of the harbor defaults
for adding cdi
cross files
Thank you so much for both the extensive reply (I am sorry if my ideas are not much useful, in programming I am just an amateur, yet I am always happy to learn.) and for the new feature, this will make the support much easier to implement.
I should be able to finish my research of CDI detection by the next weekend and come up with way to reliably detect it. Currently I am looking into the logic behind contents of /etc/cdi/nvidia.yaml
.
EDIT: I am sorry for taking longer than I expected. I caught flu and I will do my best to finish the work over this week.
According to the Nvidia documentation the CDI driver, when configured, has a config file nvidia.yaml
located at /etc/cdi/
or /var/run/cdi/
. The existence of this file is a good candidate for detection of the configured CDI driver. The other option for detection suggested in Nvidia documentation is the nvidia-ctk cdi list
command which does not work on some systems, e.g. NixOS (nvidia-ctk
is absent from default configuration).
Also the documentation suggests that other drivers need to have precedence, because the CDI cannot be used simultaneously with the nvidia
driver.
With your permission I can start working on a PR containing:
harbor.sh
for detecting CDI using the detection of the nvidia.yaml
filenvidia
driverThank you so much for getting back to me with the details!
I actually wanted to seed the cdi
files with a script instead of writing them manually, based on the existing .nvidia.
files, since the content will be pretty much a copy-paste in all instances, except a change in the file name and a service handle.
So, I just pushed:
cdi
capability detector.nvidia.
was previously definedhttps://github.com/av/harbor/commit/f18218e60408db60e40767c3ecb40c1a33fbd1c4
I'd like to kindly ask your help in validating if these files actually work as expected now (with one-two services of your choice)
Thank you, I hoped to save you some work on the project, yet I was too slow. It is quite late in my timezone, however I should be able to test it tomorrow (ollama, kobold, comfyUI/stable-diffusion).
Thanks for being considerate! Absolutely no rush timing-wise, I just happened to be around the repo and already knew what I want to do with a script, just ping me with the test results when ready, thanks!
Reading the commit I have just noticed a typo that on the line 18 of .scripts/seed-cdi.ts
there is
- driver: cdo
and there should be
- driver: cdi
Ah, I took it from your initial message, haha I'll fix the script tomorrow and will ping you here 👍
I am sorry for my mistake - I am reading my first post and I still cannot believe I made such a typo even though I read my post several times.
Absolutely no worries, I just find it amusing how it circled back to you 😀
JFYI, I pushed fixed .cdi.
files with the correct driver identifier to main
Thank you, I have installed it and I am now testing - more precisely, I am trying to figure out why cdi
cross compose files are not composed, even when cdi
is added explicitly (default_auto_capabilities
set to false).
EDIT1:
This is not a real problem - I made an error while manually setting the capabilities.
I will have to investigate further, because the the harbor down
is broken now for me, giving error
harbor down
17:20:47 [DEBUG] Active services: ollama webui
17:20:47 [DEBUG] Matched:
17:20:47 [DEBUG] Matched compose files: 203
service "perplexideez" refers to undefined config perplexideez_proxy_config: invalid compose project
however even the debug messages look wrong due to the number of matched files.
EDIT2:
This is not a real problem - I made an error while manually setting the capabilities.
When the option capabilities.autodetect
is set to true, then harbor down
works, even though the cdi
files are not getting composed. And when the option capabilities.autodetect
is set to false then harbor down
leads to the error above in the EDIT1.
EDIT3: After testing on Bluefin linux, I found out that in case there is a system with Docker and Podman simultaneously, then then there are both nvidia
and cdi
driver - causing problem for autodetection.
EDIT4: After testing on NixOS with Docker, I found that the /var/run/cdi/nvidia.yaml
config file is replaced by /var/run/cdi/nvidia-container-toolkit.json
which sets also locations of libraries for Nvidia and is functionally equivalent to the YAML file.
Suggesting fix in #119.
Additionally I noticed that the wiki suggest setting the capabilities via
harbor config set capabilities.list '<capabilities>'
which did is not read in harbor.sh
. What worked for me was instead to set
harbor config set capabilities.default '<capabilities>'
Thanks so much for the fixes! Merged to main
, will fix the docs as well 👍
Thank you, now harbor
works fine with Nvidia on NixOS (with default configuration) on Docker.
I suggest closing the issue, because Podman support is topic which would need separate issue and currently I am short of time to test it properly. Nevertheless with CDI supported, there is only one problem I am aware of, i.e. the --wait
flag currently unsupported on Podman.
Thanks for confirming! I did some brief search, but haven't found anything that'd differentiate docker compose
that runs with podman
vs the one that runs with Docker as is. In your opinion, would checking for podman
binary be sufficient or should we aim for something more sofisticated?
With a way to check for podman removing the --wait
in run_up
would be relatively trivial
I tried getting harbor
to work with Podman by having Podman, podman-compose and the adapter (one which make docker
command run podman
instead) and I removed the --wait
flag. This was before CDI and I was not testing the Nvidia compatibility.
The reason I am cannot test with Podman now, is that I on my computer I have just finished configuring Winapps (makes Windows programs run as Linux program using RDP, containers and QEMU) - another project heavily relying on docker-compose. While configuring Winapps I found that there are "some" differences which make docker-compose and podman-compose incompatible at least in edge cases like Winapps. I did not investigate into more depth, however I believe there are some minor differences in networking, which make it relevant for harbor
. Perhaps it would require testing of all of network reliant harbor
services.
I would like to help with testing Podman, however changing my configuration could break Winapps and even cause loss of Windows license. For this reason I would like to postpone it until I get my backup computer running with Podman (it has Nvidia GPU so it should be possible to do some testing despite it having CUDA compute ability too old for serious use).
If I understand correctly the current support for Nvidia GPUs relies on using adding appropriate
compose.x.<service>.nvidia.yml
file to the compose files loaded with docker-compose, these contain the device description:This is a problem for systems relying on CDI driver for the Nvidia support for containers, such as NixOS. Allowing the user to change this section based on the system configuration would also (partialy) enable use of Podman instead of Docker, since the former relies on CDI. The section for CDI system would look like:
I have tested this change for ollama + webui and it does work.
Another problem is that on CDI system such as NixOS the detection of
nvidia-ctk
fails even thoughnvidia-container-toolkit
works. This could be solved by adding an "force-nvidia" variable which would allow to force Nvidia even though the checks fail.Regarding Podman, the only place where it is currently not compatible apart from Nvidia seems to be the
--wait
flag fordocker compose
, this may be fixed in future releases of Podman.