gppm

gppm-banner

gppm power process manager

gppm is designed for use with llama.cpp and NVIDIA Tesla P40 GPUs. The standalone llama.cpp currently lacks functionality to reduce the power consumption of these GPUs in idle mode. Although there is a patch for llama.cpp, it switches the performance mode for all GPUs simultaneously, which can disrupt setups where multiple llama.cpp instances share one or more GPUs. Implementing a communication mechanism within llama.cpp to manage task distribution and GPU status is complex. gppm addresses this challenge externally, providing a more efficient solution. gppm allows you to define llama.cpp instances as code, enabling automatic spawning, termination, and respawning.

[!NOTE] Both the configuration and the API will most likely continue to change for a while. When changing to a newer version, please always take a look at the current documentation.

How it works
Quickstart
Installation
Command line interface
Configuration

How it works

gppm uses nvidia-pstate under the hood which makes it possible to switch the performance state of P40 GPUs at all. gppm must be installed on the host where the GPUs are installed and llama.cpp is running. gppm monitors llama.cpp's output to recognize tasks and on which GPU lama.cpp runs them on and with this information accordingly changes the performance modes of installed P40 GPUs. It can manage any number of GPUs and llama.cpp instances. gppm switches each GPU to a low performance state as soon as none of the existing llama.cpp instances is running a task on that particular GPU and sets it into high performancemode as soon as the next task is going to be run. In doing so, gppm is able to control all GPUs independently of each other. gppm is designed as a wrapper and as such you have all llama.cpp instances configured at one place.

Quickstart

Clone the repository and cd into it:

git clone https://github.com/crashr/gppm
cd gppm

Edit the following files to your needs:

gppmd/config.yaml
gppmd/llamacpp_configs/examples.yaml

In a separate terminal run nvidia-smi to monitor the llama.cpp instances we are going run:

watch -n 0.1 nvidia-smi

Run the gppm daemon:

python3 gppmd/gppmd.py --config ./gppmd/config.yaml --llamacpp_configs_dir ./gppmd/llamacpp_configs

Wait for the instances to show up in the nvidia-smi command teminal. gppm ships with a command line client (see details below). In another terminal run the cli like this to list the instances you just started:

python3 gppmc/gppmc.py get instances

Installation

Build binaries and DEB package

./tools/build_gppmd_deb.sh
./tools/build_gppmc_deb.sh

You should now find binaries for the daemon and the cli in the build folder:

ls -1 build/gppmd-$(git describe --tags --abbrev=0)-amd64/usr/bin/gppmd
ls -1 build/gppmc-$(git describe --tags --abbrev=0)-amd64/usr/bin/gppmc

Copy them wherever you want or install the DEB packages (described in the next step):

ls -1 build/*.deb

Install DEB package

The DEB packages are tested for the following dsitributions:

Ubuntu 22.04

Install the DEB packages like this:

sudo dpkg -i build/gppmd-$(git describe --tags --abbrev=0)-amd64.deb
sudo dpkg -i build/gppmc-$(git describe --tags --abbrev=0)-amd64.deb

gppmd awaits it's config file at /etc/gppmd/config.yaml so put your config there. It can be minimal as this:

host: '0.0.0.0'
port: 5001

gppmd looks for llama.cpp config files in /etc/gppmd/llamacpp_configs so put your configs there (see below for detailed explaination on how the configuration works).

Enable and run the daemon:

sudo systemctl enable --now gppmd.service

Command line interface

gppm comes with a cli client. It provides basic functionalities to interact with the daemon:

$ gppmc
Usage: gppmc [OPTIONS] COMMAND [ARGS]...

  Group of commands for managing llama.cpp instances and configurations.

Options:
  --host TEXT     The host to connect to.
  --port INTEGER  The port to connect to.
  --help          Show this message and exit.

Commands:
  apply    Apply LlamaCpp configurations from a YAML file.
  disable  Disable a LlamaCpp instance.
  enable   Enable a LlamaCpp instance.
  get      Get various resources.
  reload   Reload LlamaCpp configurations.

For some usage example take a look at the configuration section.

Configuration

After changing llama.cpp instance configuration files they can be reloded with the cli:

gppmc reload

This affects only instances which configs where changed. All other instances remain untouched.

The most basic configuration for a llama.cpp instance looks like this:

- name: Biggie_SmolLM_0.15B_Base_q8_0_01
  enabled: True
  env:
    CUDA_VISIBLE_DEVICES: "0"
  command:
    "/usr/local/bin/llama-server \
      --host 0.0.0.0 \
      -ngl 100 \
      -m /models/Biggie_SmolLM_0.15B_Base_q8_0.gguf \
      --port 8061 \
      -sm none \
      --no-mmap \
      --log-format json" # Remove this for version >=1.2.0

To enable gppmd to perform power state switching with NVIDIA Tesla P40 GPUs it is essential to specifiy CUDA_VISIBLE_DEVICES and json log format.

gppm allows to configure post launch hooks. With that it is possible to bundle complex setups. As an example the following configuration creates a setup consisting of two llama.cpp instances running Codestral on three GPUs behind a load balancer. For the load balancer Paddler is used:

- name: "Codestral-22B-v0.1-Q8_0 (paddler balancer)"
  enabled: True
  command:
    "/usr/local/bin/paddler balancer \
      --management-host 0.0.0.0 \
      --management-port 8085 \
      --management-dashboard-enable=true \
      --reverseproxy-host 192.168.178.56 \
      --reverseproxy-port 8081"

- name: "Codestral-22B-v0.1-Q8_0 (llama.cpp 01)"
  enabled: True
  env:
    CUDA_VISIBLE_DEVICES: "0,1,2"
  command:
    "/usr/local/bin/llama-server \
      --host 0.0.0.0 \
      -ngl 100 \
      -m /models/Codestral-22B-v0.1-Q8_0.gguf \
      --port 8082 \
      -fa \
      -sm row \
      -mg 0 \
      --no-mmap \
      --slots \
      --log-format json" # Remove this for version >=1.2.0
  post_launch_hooks:
  - name: Codestral-22B-v0.1-Q8_0_(paddler_01)
    enabled: True
    command:
      "/usr/local/bin/paddler agent \
        --name 'Codestral-22B-v0.1-Q8_0 (llama.cpp 01)' \
        --external-llamacpp-host 192.168.178.56 \
        --external-llamacpp-port 8082 \
        --local-llamacpp-host 192.168.178.56 \
        --local-llamacpp-port 8082 \
        --management-host 192.168.178.56 \
        --management-port 8085"

- name: "Codestral-22B-v0.1-Q8_0_(llama.cpp_02)"
  enabled: True
  env:
    CUDA_VISIBLE_DEVICES: "0,1,2"
  command:
    "/usr/local/bin/llama-server \
      --host 0.0.0.0 \
      -ngl 100 \
      -m /models/Codestral-22B-v0.1-Q8_0.gguf \
      --port 8083 \
      -fa \
      -sm row \
      -mg 1 \
      --no-mmap \
      --log-format json" # Remove this for version >=1.2.0
  post_launch_hooks:
  - name: "Codestral-22B-v0.1-Q8_0_Paddler_02"
    enabled: True
    command:
      "/usr/local/bin/paddler agent \
        --name 'Codestral-22B-v0.1-Q8_0 (llama.cpp 02)' \
        --external-llamacpp-host 192.168.178.56 \
        --external-llamacpp-port 8083 \
        --local-llamacpp-host 192.168.178.56 \
        --local-llamacpp-port 8083 \
        --management-host 192.168.178.56 \
        --management-port 8085"

More to come soon

crashr / gppm

readme