Add load balancing - Githubissues

JBGruber commented 5 months ago

The same approach implemented in #16 could also be used to send requests to multiple Ollama servers at once to process requests in parallel. There are at least two approaches we could follow:

naive: we distribute requests equally among servers and wait for all responses.
advanced: we send a few requests to each server and then poll which instance has returned responses. As soon as a server has fewer than x open requests in the queue, we send more.

In 1., the total run time would be determined by the slowest instance. 2. would be much more efficient in scenarios with a mix of fast and slow machines, but also harder to implement.

JBGruber commented 1 month ago

This works now in the output branch. I opted to do something between naive and advanced. When you supply a vector of servers, you can assign a name to each, corresponding to what share of requests should be fulfilled by that server. So c("0.6" = "http://localhost:11434/", "0.4" = "http://192.168.2.45:11434/") will hand 60% of requests to localhost and 40% to the remote computer. It's pretty quick:

library(rollama)
library(tidyverse)

reviews_df <- read_csv("https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv",
                       show_col_types = FALSE) |> 
  sample_n(500)
#> New names:
#> • `` -> `...1`

make_query <- function(t) {
  tribble(
    ~role,    ~content,
    "system", "You assign texts into categories. Answer with just the correct category, which is either {positive}, {neutral} or {negative}.",
    "user", t
  )
}

start <- Sys.time()
reviews_df_annotated <- reviews_df |> 
  mutate(query = map(`Review Text`, make_query),
         category = query(query, screen = FALSE,
                          model = "llama3.2:3b-instruct-q8_0", 
                          server = c("0.6" = "http://localhost:11434/", 
                                     "0.4" = "http://192.168.2.45:11434/"), 
                          output = "text"))
stop <- Sys.time()
stop - start
#> Time difference of 18.19546 secs

^{Created on 2024-10-18 with reprex v2.1.0}

bshor commented 2 weeks ago

I'm trying to understand (and implement) this.

Would this be the functional equivalent of having two GPU's (for example) in 1 system? That is, they could handle through combined VRAM a much larger model?

Or this merely going to split a single request at the ratios you select, and it would just merely process everything quicker, but not take advantage of the larger combined VRAM?

Or is this an implementation of Ollama's new parallel request feature?

JBGruber commented 2 weeks ago

AFAIK there is no way to combine the vram on consumer gpus. This is indeed just using parallel requests and if you have multiple machines (or gpus running multiple instances of Ollama I guess) you can divide the queue. E.g., you have two PCs with gpu and one laptop. The laptop will be slow but could still fulfill some of the requests if you have thousands.

bshor commented 2 weeks ago

Ok, but do you mean queues in what sense? Within the context of a single ollama call that may just take a long time to complete, or do you mean individual calls that are kept in a list or something?

JBGruber / rollama

Add load balancing #17