Open alexschroeter opened 1 month ago
Yes this is a bit of a tricky issue, i was hoping there was an open-standard for "node-selectors"/"node-affinity": https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/, but i couldn't really find one. Maybe some preliminary selectors would be a great idea: similar to how requirements for services are now implemented. Here was a draft for this https://github.com/jhnnsrs/arkitekt_next/blob/main/arkitekt_next/cli/types.py
Contrary to what is outlined as a "build_docker_params" there, i don't believe this should be handled by the library itself but should be handled by the engine , i.e this app, trying to inspect the selectors and choose which params to pass. this would allow us to be backwards compatibly with different version of the docker, apptainer api (because these fuckers change all the time :D). What do you think?
Which could be translated to the underlying engine . I imagine
Yes this is a bit of a tricky issue, i was hoping there was an open-standard for "node-selectors"/"node-affinity": https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/, but i couldn't really find one
I know only of these node selectors for GPUs (https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
Maybe some preliminary selectors would be a great idea: similar to how requirements for services are now implemented. Here was a draft for this https://github.com/jhnnsrs/arkitekt_next/blob/main/arkitekt_next/cli/types.py
I think these preliminary selectors are quite sufficient for a while, and I would go for this solution until more practical examples give us some guidance on common use cases. I imagine a basic matrix of 1 rather static setting per combination of container technology
X GPU vendor
to be sufficient for quite a while.
Contrary to what is outlined as a "build_docker_params" there, i don't believe this should be handled by the library itself but should be handled by the engine , i.e this app, trying to inspect the selectors and choose which params to pass. this would allow us to be backwards compatibly with different version of the docker, apptainer api (because these fuckers change all the time :D). What do you think?
So you would have each arkitekt-next run prod <docker://jhnnsrs/container:version-gpu_flavor>
you wouldn't be able to determine the correct run parameters without the deployer.
TLDR, I would probably go for the quick solution and use the example that you linked and put in the 2x2 combinations that would give us functionality for now. I at some point created a simple GPU test App which I would showcase this automation step with and showcase in the documentation.
I am unsure what is the best way to translate requirements into settings.
If we have a requirement of
gpu.amd
this means we need to add the flag to--roccm
to the start for a simple setup. This translation can either happen on an "Arkitekt Level" which would allow:But to allow for exceptions (maybe one needs more fine-grained control over the settings) some overwrite mechanism which would allow overwriting the Arkitekt default would be nice.