How to handle GPU flavor settings

alexschroeter commented 1 month ago

I am unsure what is the best way to translate requirements into settings.

If we have a requirement of gpu.amd this means we need to add the flag to --roccm to the start for a simple setup. This translation can either happen on an "Arkitekt Level" which would allow:

Updates to the Interface to easily propagate without needing to update all Apps
Settings would be consistent throughout Arkitekt

But to allow for exceptions (maybe one needs more fine-grained control over the settings) some overwrite mechanism which would allow overwriting the Arkitekt default would be nice.

jhnnsrs commented 1 month ago

Yes this is a bit of a tricky issue, i was hoping there was an open-standard for "node-selectors"/"node-affinity": https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/, but i couldn't really find one. Maybe some preliminary selectors would be a great idea: similar to how requirements for services are now implemented. Here was a draft for this https://github.com/jhnnsrs/arkitekt_next/blob/main/arkitekt_next/cli/types.py

Contrary to what is outlined as a "build_docker_params" there, i don't believe this should be handled by the library itself but should be handled by the engine , i.e this app, trying to inspect the selectors and choose which params to pass. this would allow us to be backwards compatibly with different version of the docker, apptainer api (because these fuckers change all the time :D). What do you think?

Which could be translated to the underlying engine . I imagine

alexschroeter commented 1 month ago

Yes this is a bit of a tricky issue, i was hoping there was an open-standard for "node-selectors"/"node-affinity": https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/, but i couldn't really find one

I know only of these node selectors for GPUs (https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).

Maybe some preliminary selectors would be a great idea: similar to how requirements for services are now implemented. Here was a draft for this https://github.com/jhnnsrs/arkitekt_next/blob/main/arkitekt_next/cli/types.py

I think these preliminary selectors are quite sufficient for a while, and I would go for this solution until more practical examples give us some guidance on common use cases. I imagine a basic matrix of 1 rather static setting per combination of container technology X GPU vendor to be sufficient for quite a while.

Contrary to what is outlined as a "build_docker_params" there, i don't believe this should be handled by the library itself but should be handled by the engine , i.e this app, trying to inspect the selectors and choose which params to pass. this would allow us to be backwards compatibly with different version of the docker, apptainer api (because these fuckers change all the time :D). What do you think?

So you would have each -deployer generate the settings from a flag that tells it that we want access to the nvidia GPU? This would in deed have the advantage that the deployer could handle the complexities that come with different OS / GPU Software Versions and so on. On the contrary, and I am not sure if this is something that we want to do. Running an App by invoking arkitekt-next run prod <docker://jhnnsrs/container:version-gpu_flavor> you wouldn't be able to determine the correct run parameters without the deployer.

TLDR, I would probably go for the quick solution and use the example that you linked and put in the 2x2 combinations that would give us functionality for now. I at some point created a simple GPU test App which I would showcase this automation step with and showcase in the documentation.

alexschroeter / apptainer-deployer

How to handle GPU flavor settings #2