arkitektio / arkitektio.github.io

The documentation for the Arkitekt Platform powered by docusaurus
https:/arkitekt.live
5 stars 1 forks source link

Enhancements for Hardware Support in Docker Containers #2

Open jhnnsrs opened 8 months ago

jhnnsrs commented 8 months ago

Context: As the field of bioimaging continues to evolve, it encounters an increasingly diverse array of specialized hardware. This diversity offers powerful capabilities but also presents significant challenges in compatibility and optimization. Users often face complexities in installing and running bioimaging applications due to these varied hardware requirements.

Arkitekt's Container system (Port) should aim to bridge this gap by intelligently managing and assisting in the deployment of applications on the appropriate hardware.

Objectives:

  1. Simplify User Experience: Users should seamlessly install and run applications without needing deep knowledge of their underlying hardware dependencies.
  2. Optimize Resource Utilization: Efficiently use the available hardware by matching applications to the most suitable computational resources.
  3. Enhance Flexibility and Scalability: Accommodate a growing range of hardware options (NVIDIA, AMD, ARM, RISC-V) and configurations to future-proof against rapid technological advancements.

Current State and Limitations:

Port lacks a dedicated system for more advanced container (pod) scheduling, based on different hardware requirments of the underlying plugin continaer.

At present, Port schedules containers using direct Docker calls via the Docker HTTP API, inserting flags based on stated requirements in the app's manifest. However, this system's flexibility is limited, only supporting a general "gpu" requirement which triggers the "--gpu all" flag. This approach is a preliminary step towards hardware access but doesn't accommodate the diversity of hardware environments effectively.

Example:

In the scenario of a CUDA enabled app like segmentor, and a corresponding deployment of the app:

- app: segmentor
  version: 0.4.3
  image: jhnnsrs/segmentor:0.4.3
  requirements: ["gpu"]

The port scheduler will run the container with

docker run ..container name --gpu all

This will fail for a non CUDA enabled environment and render the deployment uninstallable. This currently is remedied by the fact that the user installs a seperate "segmentor_cpu" app with its own release circle and CPU optimized images. This is far from user friendly, and does not fit in the arkitekt concept of versioned apps.

Proposed Solution:

Introduce the concept of "Flavours" for App Containers. Each app version can have multiple flavours, each representing a container bundled with its specific hardware requirements. This way, apps can deploy different flavours optimized for various hardware setups like CUDA GPUs, AMD GPUs, or CPU-only environments.

Port will assist in selecting the correct flavour of an app's release based on the "container scheduler abilities". This will require a new format for the deployments file:

- app: segmentor
  version: 0.4.3
  flavours: 
    - cuda:
        - priority: 100
        - image: jhnnsrs/segmentor:0.4.3-cuda
        - requirements:
            - gpu_count: ">= 2"
            - nvidia_cuda: ">11"
    - cpu:
        - priority: 20
        - image: jhnnsrs/segmentor:0.4.3-cpu
        - requirements:
            - cpu_mhz: ">1400"

The scheduler will then prioritize and schedule containers based on the available hardware and the priority ratings of each flavour. Here this proposal follows closely the concept of Node Affinities and Node Selectors in Kubernetes, which will be included as a multi-node scheduling option in the next iteration of Port.

Raised Questions:

Why a spec-level change?

With multiple hardware and software companies designing their specialized silicon, addressing multi-platform compatibility in a unified, bioimage-focused specification seems imperative for a future proof manifest format.

Why not use node affinities directly?

While Kubernetes provides an excellent blueprint and should be a focus, relying entirely on node labels for scheduling may lead to overly specific requirements that are hard to map in different environments. Instead, this should be managed by the scheduling backend in Port to maintain flexibility and broad compatibility.

This enhancement aims to make bioimaging software more accessible and efficient, aligning with the diverse and specialized nature of modern computational hardware. Happy for Your feedback and suggestions on this proposal.