Open JBGruber opened 5 months ago
This works now in the output branch. I opted to do something between naive and advanced. When you supply a vector of servers, you can assign a name to each, corresponding to what share of requests should be fulfilled by that server. So c("0.6" = "http://localhost:11434/", "0.4" = "http://192.168.2.45:11434/")
will hand 60% of requests to localhost and 40% to the remote computer. It's pretty quick:
library(rollama)
library(tidyverse)
reviews_df <- read_csv("https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv",
show_col_types = FALSE) |>
sample_n(500)
#> New names:
#> • `` -> `...1`
make_query <- function(t) {
tribble(
~role, ~content,
"system", "You assign texts into categories. Answer with just the correct category, which is either {positive}, {neutral} or {negative}.",
"user", t
)
}
start <- Sys.time()
reviews_df_annotated <- reviews_df |>
mutate(query = map(`Review Text`, make_query),
category = query(query, screen = FALSE,
model = "llama3.2:3b-instruct-q8_0",
server = c("0.6" = "http://localhost:11434/",
"0.4" = "http://192.168.2.45:11434/"),
output = "text"))
stop <- Sys.time()
stop - start
#> Time difference of 18.19546 secs
Created on 2024-10-18 with reprex v2.1.0
I'm trying to understand (and implement) this.
Would this be the functional equivalent of having two GPU's (for example) in 1 system? That is, they could handle through combined VRAM a much larger model?
Or this merely going to split a single request at the ratios you select, and it would just merely process everything quicker, but not take advantage of the larger combined VRAM?
Or is this an implementation of Ollama's new parallel request feature?
AFAIK there is no way to combine the vram on consumer gpus. This is indeed just using parallel requests and if you have multiple machines (or gpus running multiple instances of Ollama I guess) you can divide the queue. E.g., you have two PCs with gpu and one laptop. The laptop will be slow but could still fulfill some of the requests if you have thousands.
Ok, but do you mean queues in what sense? Within the context of a single ollama call that may just take a long time to complete, or do you mean individual calls that are kept in a list or something?
The same approach implemented in #16 could also be used to send requests to multiple Ollama servers at once to process requests in parallel. There are at least two approaches we could follow:
In 1., the total run time would be determined by the slowest instance. 2. would be much more efficient in scenarios with a mix of fast and slow machines, but also harder to implement.