Closed AlphaAtlas closed 1 year ago
Here is a graph of koboldcpp for reference, which generates text in "chunks" for its streaming mode:
Also, I can upload these profile files if anyone wants to take a peek for themselves, just ask. :+1:
The next step is to profile the low level API and profile both APIs with py spy, but I don't have either set up yet.
@AlphaAtlas thank you for the detailed report, aside from py-spy it may also be useful to profile memory allocation / usage.
My guess is that it's to do with memory allocations, that's one area I've probably been a little too lazy / unprincipled. I would guess it has to do with allocating / freeing new candidates arrays between calls to the llama.cpp sampling functions, this should really only be done once.
@AlphaAtlas If you compare timings between llama.cpp and llama-cpp-python with CPU only do you see a difference?
I did some tests with your script as that's a better way to test the lib itself directly. Here are my results; I modified the script to run on 4 threads as that's what my computer can do. I also used a longer prompt to test both ingestion and generation. All tests were done on 7B q4_0 with OpenBLAS.
original llama.cpp
llama_print_timings: load time = 28131.71 ms
llama_print_timings: sample time = 341.41 ms / 100 runs ( 3.41 ms per token)
llama_print_timings: prompt eval time = 26056.49 ms / 353 tokens ( 73.81 ms per token)
llama_print_timings: eval time = 25333.35 ms / 99 runs ( 255.89 ms per token)
llama_print_timings: total time = 53818.36 ms
llama-cpp-python with script
llama_print_timings: load time = 13476.72 ms
llama_print_timings: sample time = 65.66 ms / 100 runs ( 0.66 ms per token)
llama_print_timings: prompt eval time = 13476.50 ms / 125 tokens ( 107.81 ms per token)
llama_print_timings: eval time = 22492.65 ms / 99 runs ( 227.20 ms per token)
llama_print_timings: total time = 41234.30 ms
I ran both of these examples multiple times and the results above are representative of what I saw on average. What's interesting is that llama-cpp-python is slightly faster in generation, possibly due to lack of streaming.
With the webui, eval time is still in the order of ~350ms a token.
Here are webui results with streaming turned on and off as a comparison. Without streaming webui results are matching what I'm seeing with @AlphaAtlas's script.
Streaming on:
llama_print_timings: eval time = 25196.29 ms / 77 runs ( 327.22 ms per token)
Streaming off:
llama_print_timings: eval time = 22490.24 ms / 99 runs ( 227.17 ms per token)
I did some tests with your script as that's a better way to test the lib itself directly. Here are my results; I modified the script to run on 4 threads as that's what my computer can do. I also used a longer prompt to test both ingestion and generation. All tests were done on 7B q4_0 with OpenBLAS.
original llama.cpp
llama_print_timings: load time = 28131.71 ms llama_print_timings: sample time = 341.41 ms / 100 runs ( 3.41 ms per token) llama_print_timings: prompt eval time = 26056.49 ms / 353 tokens ( 73.81 ms per token) llama_print_timings: eval time = 25333.35 ms / 99 runs ( 255.89 ms per token) llama_print_timings: total time = 53818.36 ms
llama-cpp-python with script
llama_print_timings: load time = 13476.72 ms llama_print_timings: sample time = 65.66 ms / 100 runs ( 0.66 ms per token) llama_print_timings: prompt eval time = 13476.50 ms / 125 tokens ( 107.81 ms per token) llama_print_timings: eval time = 22492.65 ms / 99 runs ( 227.20 ms per token) llama_print_timings: total time = 41234.30 ms
I ran both of these examples multiple times and the results above are representative of what I saw on average. What's interesting is that llama-cpp-python is slightly faster in generation, possibly due to lack of streaming.
With the webui, eval time is still in the order of ~350ms a token.
I think you have to take the results with a grain of salt, as the timings for me were almost exactly the same even though llama-cpp-python was visibly slower. Maybe llama.cpp is "sleeping" some of the time and hence that time is not part of its performance metrics?
Not sure about the llama.cpp metrics difference... Maybe Python is eating some CPU time doing something, which would be more visible on a 4 thread machine.
@eiery @AlphaAtlas the the llama.cpp timings are only the portion of time spent in the llama.cpp eval and sample library functions not the rest of the program. The flamegraph would show you a better indication of total times but could be susceptible to errors from the sampling method.
Here is llama-cpp-python with gpu-layers set to 0, as expected it still processes the prompt but doesn't use the GPU afterwards. Looks like its still bound to a single thread in intervals:
Here is Functiontrace lined up with the cuda profiler... sort of. The profiles don't start at the same time so it may be offset, but the intervals look about right:
Also all this was done on the current git commit.
Still need to test the low level API, and memory allocation. If benching it is still necessary, would you mind posting an example "benchmark" script for the low level API?
Oh and here is the full python profile. You can open it with https://profiler.firefox.com.
@AlphaAtlas based on your latest screenshot
Quite a bit of time is being spent on a list comprehension inside of the sampling methods. I've pushed a commit that replaces this list comprehension with a simple assignment to a data structure I create once when the Llama object is initialized, I've profile the change and it does seem to reduce sampling time in practice.
Let me know if it's any better for you.
Sorry my testing is so intermittent! The latest commit did improve the gap, but the "downtime" where most cores are idle is still there:
Above ^ test is the latest commit with CUBLAS and a q4_0 model. And this time I am using the full 2000 token context instead of a short testing prompt.
Also, here are the full steps for testing (on linux) in case anyone wants to try this out.
PATH="~/.cargo/bin:$PATH" nsys profile --gpu-metrics-device=0 venv/bin/python perftest.py
ncu-ui
You can just use functiontrace and a CPU graph if you don't use the Nvidia profiler.
@AlphaAtlas I think I've narrowed it down now, it looks like there's about 20ms pause per sample where it's copying the last logits into the candidates token_data_array. Other than that it looks like the majority of time is spent inside the c++ functions in the shared library. I'll work on trying to remove this delay though it's a little challenging as I can't just memcpy the logits because the candidate data is an array of structs.
Here's the line-by-line profiling I got by using line_profiler
(see the _sample function)
Wrote profile results to perftest.py.lprof
Timer unit: 1e-06 s
Total time: 45.4287 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: eval at line 271
Line # Hits Time Per Hit % Time Line Contents
==============================================================
271 @profile
272 def eval(self, tokens: Sequence[int]):
273 """Evaluate a list of tokens.
274
275 Args:
276 tokens: The list of tokens to evaluate.
277 """
278 80 26.6 0.3 0.0 assert self.ctx is not None
279 80 18.0 0.2 0.0 n_ctx = self._n_ctx
280 83 129.1 1.6 0.0 for i in range(0, len(tokens), self.n_batch):
281 83 118.8 1.4 0.0 batch = tokens[i : min(len(tokens), i + self.n_batch)]
282 83 58.2 0.7 0.0 n_past = min(n_ctx - len(batch), len(self.eval_tokens))
283 83 19.0 0.2 0.0 n_tokens = len(batch)
284 83 45370154.8 546628.4 99.9 return_code = llama_cpp.llama_eval(
285 83 17.8 0.2 0.0 ctx=self.ctx,
286 83 375.8 4.5 0.0 tokens=(llama_cpp.llama_token * len(batch))(*batch),
287 83 34.3 0.4 0.0 n_tokens=llama_cpp.c_int(n_tokens),
288 83 20.4 0.2 0.0 n_past=llama_cpp.c_int(n_past),
289 83 32.1 0.4 0.0 n_threads=llama_cpp.c_int(self.n_threads),
290 )
291 83 241.6 2.9 0.0 if return_code != 0:
292 raise RuntimeError(f"llama_eval returned {return_code}")
293 # Save tokens
294 83 468.0 5.6 0.0 self.eval_tokens.extend(batch)
295 # Save logits
296 83 246.0 3.0 0.0 rows = n_tokens if self.params.logits_all else 1
297 83 56.7 0.7 0.0 n_vocab = self._n_vocab
298 83 82.6 1.0 0.0 cols = n_vocab
299 83 1145.8 13.8 0.0 logits_view = llama_cpp.llama_get_logits(self.ctx)
300 83 40884.4 492.6 0.1 logits = [logits_view[i * cols : (i + 1) * cols] for i in range(rows)]
301 83 14575.1 175.6 0.0 self.eval_logits.extend(logits)
Total time: 2.81234 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: _sample at line 303
Line # Hits Time Per Hit % Time Line Contents
==============================================================
303 @profile
304 def _sample(
305 self,
306 last_n_tokens_data, # type: llama_cpp.Array[llama_cpp.llama_token]
307 last_n_tokens_size: llama_cpp.c_int,
308 top_k: llama_cpp.c_int,
309 top_p: llama_cpp.c_float,
310 temp: llama_cpp.c_float,
311 tfs_z: llama_cpp.c_float,
312 repeat_penalty: llama_cpp.c_float,
313 frequency_penalty: llama_cpp.c_float,
314 presence_penalty: llama_cpp.c_float,
315 mirostat_mode: llama_cpp.c_int,
316 mirostat_tau: llama_cpp.c_float,
317 mirostat_eta: llama_cpp.c_float,
318 penalize_nl: bool = True,
319 ):
320 80 54.6 0.7 0.0 assert self.ctx is not None
321 # assert len(self.eval_logits) > 0
322 80 27.4 0.3 0.0 n_vocab = self._n_vocab
323 80 21.8 0.3 0.0 n_ctx = self._n_ctx
324 80 53.6 0.7 0.0 top_k = llama_cpp.c_int(n_vocab) if top_k.value <= 0 else top_k
325 80 10.8 0.1 0.0 last_n_tokens_size = (
326 80 11.9 0.1 0.0 llama_cpp.c_int(n_ctx)
327 80 16.2 0.2 0.0 if last_n_tokens_size.value < 0
328 80 10.3 0.1 0.0 else last_n_tokens_size
329 )
330 80 37.6 0.5 0.0 logits = self.eval_logits[-1]
331 80 25.6 0.3 0.0 nl_logit = logits[self._token_nl]
332 80 17.6 0.2 0.0 candidates = self._candidates
333 2560080 713785.8 0.3 25.4 for i, (data, logit) in enumerate(zip(candidates.data, logits)):
334 2560080 704819.0 0.3 25.1 data.id = llama_cpp.llama_token(i)
335 2560080 655928.1 0.3 23.3 data.logit = llama_cpp.c_float(logit)
336 2560080 699865.4 0.3 24.9 data.p = llama_cpp.c_float(0.0)
337 80 55.8 0.7 0.0 candidates.sorted = llama_cpp.c_bool(False)
338 80 60.8 0.8 0.0 candidates.size = llama_cpp.c_size_t(n_vocab)
339 80 33448.7 418.1 1.2 llama_cpp.llama_sample_repetition_penalty(
340 80 26.4 0.3 0.0 ctx=self.ctx,
341 80 16.5 0.2 0.0 last_tokens_data=last_n_tokens_data,
342 80 12.1 0.2 0.0 last_tokens_size=last_n_tokens_size,
343 80 87.6 1.1 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
344 80 10.9 0.1 0.0 penalty=repeat_penalty,
345 )
346 80 285.0 3.6 0.0 llama_cpp.llama_sample_frequency_and_presence_penalties(
347 80 18.6 0.2 0.0 ctx=self.ctx,
348 80 39.3 0.5 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
349 80 18.9 0.2 0.0 last_tokens_data=last_n_tokens_data,
350 80 14.5 0.2 0.0 last_tokens_size=last_n_tokens_size,
351 80 16.5 0.2 0.0 alpha_frequency=frequency_penalty,
352 80 9.5 0.1 0.0 alpha_presence=presence_penalty,
353 )
354 80 16.7 0.2 0.0 if not penalize_nl:
355 candidates.data[self._token_nl].logit = llama_cpp.c_float(nl_logit)
356 80 64.3 0.8 0.0 if temp.value == 0.0:
357 return llama_cpp.llama_sample_token_greedy(
358 ctx=self.ctx,
359 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
360 )
361 80 27.4 0.3 0.0 elif mirostat_mode.value == 1:
362 mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
363 mirostat_m = llama_cpp.c_int(100)
364 llama_cpp.llama_sample_temperature(
365 ctx=self.ctx,
366 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
367 temp=temp,
368 )
369 return llama_cpp.llama_sample_token_mirostat(
370 ctx=self.ctx,
371 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
372 tau=mirostat_tau,
373 eta=mirostat_eta,
374 mu=llama_cpp.ctypes.byref(mirostat_mu), # type: ignore
375 m=mirostat_m,
376 )
377 80 25.2 0.3 0.0 elif mirostat_mode.value == 2:
378 mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
379 llama_cpp.llama_sample_temperature(
380 ctx=self.ctx,
381 candidates=llama_cpp.ctypes.pointer(candidates),
382 temp=temp,
383 )
384 return llama_cpp.llama_sample_token_mirostat_v2(
385 ctx=self.ctx,
386 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
387 tau=mirostat_tau,
388 eta=mirostat_eta,
389 mu=llama_cpp.ctypes.byref(mirostat_mu), # type: ignore
390 )
391 else:
392 80 2192.0 27.4 0.1 llama_cpp.llama_sample_top_k(
393 80 17.0 0.2 0.0 ctx=self.ctx,
394 80 27.9 0.3 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
395 80 13.0 0.2 0.0 k=top_k,
396 80 32.6 0.4 0.0 min_keep=llama_cpp.c_size_t(1),
397 )
398 80 159.8 2.0 0.0 llama_cpp.llama_sample_tail_free(
399 80 20.5 0.3 0.0 ctx=self.ctx,
400 80 34.4 0.4 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
401 80 19.6 0.2 0.0 z=tfs_z,
402 80 29.7 0.4 0.0 min_keep=llama_cpp.c_size_t(1),
403 )
404 80 105.2 1.3 0.0 llama_cpp.llama_sample_typical(
405 80 23.0 0.3 0.0 ctx=self.ctx,
406 80 25.6 0.3 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
407 80 25.2 0.3 0.0 p=llama_cpp.c_float(1.0),
408 80 22.2 0.3 0.0 min_keep=llama_cpp.c_size_t(1),
409 )
410 80 157.1 2.0 0.0 llama_cpp.llama_sample_top_p(
411 80 17.2 0.2 0.0 ctx=self.ctx,
412 80 23.9 0.3 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
413 80 14.5 0.2 0.0 p=top_p,
414 80 22.1 0.3 0.0 min_keep=llama_cpp.c_size_t(1),
415 )
416 80 130.3 1.6 0.0 llama_cpp.llama_sample_temperature(
417 80 16.8 0.2 0.0 ctx=self.ctx,
418 80 24.6 0.3 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
419 80 15.0 0.2 0.0 temp=temp,
420 )
421 80 192.2 2.4 0.0 return llama_cpp.llama_sample_token(
422 80 16.4 0.2 0.0 ctx=self.ctx,
423 80 24.9 0.3 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
424 )
Total time: 50.3015 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: generate at line 473
Line # Hits Time Per Hit % Time Line Contents
==============================================================
473 @profile
474 def generate(
475 self,
476 tokens: Sequence[int],
477 top_k: int = 40,
478 top_p: float = 0.95,
479 temp: float = 0.80,
480 repeat_penalty: float = 1.1,
481 reset: bool = True,
482 frequency_penalty: float = 0.0,
483 presence_penalty: float = 0.0,
484 tfs_z: float = 1.0,
485 mirostat_mode: int = 0,
486 mirostat_tau: float = 5.0,
487 mirostat_eta: float = 0.1,
488 ) -> Generator[int, Optional[Sequence[int]], None]:
489 """Create a generator of tokens from a prompt.
490
491 Examples:
492 >>> llama = Llama("models/ggml-7b.bin")
493 >>> tokens = llama.tokenize(b"Hello, world!")
494 >>> for token in llama.generate(tokens, top_k=40, top_p=0.95, temp=1.0, repeat_penalty=1.1):
495 ... print(llama.detokenize([token]))
496
497 Args:
498 tokens: The prompt tokens.
499 top_k: The top-k sampling parameter.
500 top_p: The top-p sampling parameter.
501 temp: The temperature parameter.
502 repeat_penalty: The repeat penalty parameter.
503 reset: Whether to reset the model state.
504
505 Yields:
506 The generated tokens.
507 """
508 1 0.9 0.9 0.0 assert self.ctx is not None
509
510 1 0.9 0.9 0.0 if reset and len(self.eval_tokens) > 0:
511 longest_prefix = 0
512 for a, b in zip(self.eval_tokens, tokens[:-1]):
513 if a == b:
514 longest_prefix += 1
515 else:
516 break
517 if longest_prefix > 0:
518 if self.verbose:
519 print("Llama.generate: prefix-match hit", file=sys.stderr)
520 reset = False
521 tokens = tokens[longest_prefix:]
522 for _ in range(len(self.eval_tokens) - longest_prefix):
523 self.eval_tokens.pop()
524 try:
525 self.eval_logits.pop()
526 except IndexError:
527 pass
528
529 1 0.4 0.4 0.0 if reset:
530 1 2.3 2.3 0.0 self.reset()
531
532 while True:
533 80 45430708.1 567883.9 90.3 self.eval(tokens)
534 80 4870570.6 60882.1 9.7 token = self.sample(
535 80 28.4 0.4 0.0 top_k=top_k,
536 80 12.2 0.2 0.0 top_p=top_p,
537 80 20.7 0.3 0.0 temp=temp,
538 80 9.7 0.1 0.0 repeat_penalty=repeat_penalty,
539 80 28.9 0.4 0.0 frequency_penalty=frequency_penalty,
540 80 9.9 0.1 0.0 presence_penalty=presence_penalty,
541 80 19.5 0.2 0.0 tfs_z=tfs_z,
542 80 21.2 0.3 0.0 mirostat_mode=mirostat_mode,
543 80 26.4 0.3 0.0 mirostat_tau=mirostat_tau,
544 80 12.4 0.2 0.0 mirostat_eta=mirostat_eta,
545 )
546 80 11.6 0.1 0.0 tokens_or_none = yield token
547 79 37.6 0.5 0.0 tokens = [token]
548 79 22.1 0.3 0.0 if tokens_or_none is not None:
549 tokens.extend(tokens_or_none)
I think the solution is to just move to using numpy, it introduces an additional dependency but it should reduce memory usage and speed up a few sections like this.
I think the solution is to just move to using numpy, it introduces an additional dependency but it should reduce memory usage and speed up a few sections like this.
Chances are high, whoever uses this package has numpy installed anyway in their environment.
@AlphaAtlas got the numpy implementation working and it seems to improve the performance as excpected, the PR is still open #277 here but I should be merging it soon
Total time: 0.0551852 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: _sample at line 345
Line # Hits Time Per Hit % Time Line Contents
==============================================================
345 @profile
346 def _sample(
347 self,
348 last_n_tokens_data, # type: llama_cpp.Array[llama_cpp.llama_token]
349 last_n_tokens_size: llama_cpp.c_int,
350 top_k: llama_cpp.c_int,
351 top_p: llama_cpp.c_float,
352 temp: llama_cpp.c_float,
353 tfs_z: llama_cpp.c_float,
354 repeat_penalty: llama_cpp.c_float,
355 frequency_penalty: llama_cpp.c_float,
356 presence_penalty: llama_cpp.c_float,
357 mirostat_mode: llama_cpp.c_int,
358 mirostat_tau: llama_cpp.c_float,
359 mirostat_eta: llama_cpp.c_float,
360 penalize_nl: bool = True,
361 logits_processor: Optional[LogitsProcessorList] = None,
362 ):
363 80 91.1 1.1 0.2 assert self.ctx is not None
364 80 75.4 0.9 0.1 assert len(self.eval_logits) > 0
365 80 89.2 1.1 0.2 assert self._scores.shape[0] > 0
366 80 31.9 0.4 0.1 n_vocab = self._n_vocab
367 80 35.1 0.4 0.1 n_ctx = self._n_ctx
368 80 64.9 0.8 0.1 top_k = llama_cpp.c_int(n_vocab) if top_k.value <= 0 else top_k
369 80 12.0 0.2 0.0 last_n_tokens_size = (
370 80 12.4 0.2 0.0 llama_cpp.c_int(n_ctx)
371 80 36.7 0.5 0.1 if last_n_tokens_size.value < 0
372 80 23.1 0.3 0.0 else last_n_tokens_size
373 )
374 80 86.9 1.1 0.2 logits: npt.NDArray[np.single] = self._scores[-1, :]
375
376 80 37.6 0.5 0.1 if logits_processor is not None:
377 logits = np.array(
378 logits_processor(self._input_ids.tolist(), logits.tolist()),
379 dtype=np.single,
380 )
381 self._scores[-1, :] = logits
382 self.eval_logits[-1] = logits.tolist()
383
384 80 98.0 1.2 0.2 nl_logit = logits[self._token_nl]
385 80 29.1 0.4 0.1 candidates = self._candidates
386 80 17.5 0.2 0.0 candidates_data = self._candidates_data
387 80 6221.9 77.8 11.3 candidates_data["id"] = np.arange(n_vocab, dtype=np.intc) # type: ignore
388 80 4137.7 51.7 7.5 candidates_data["logit"] = logits
389 80 3356.6 42.0 6.1 candidates_data["p"] = np.zeros(n_vocab, dtype=np.single)
390 80 1924.4 24.1 3.5 candidates.data = candidates_data.ctypes.data_as(llama_cpp.llama_token_data_p)
391 80 92.3 1.2 0.2 candidates.sorted = llama_cpp.c_bool(False)
392 80 88.9 1.1 0.2 candidates.size = llama_cpp.c_size_t(n_vocab)
393 80 34102.2 426.3 61.8 llama_cpp.llama_sample_repetition_penalty(
394 80 29.6 0.4 0.1 ctx=self.ctx,
395 80 17.7 0.2 0.0 last_tokens_data=last_n_tokens_data,
396 80 20.5 0.3 0.0 last_tokens_size=last_n_tokens_size,
397 80 64.2 0.8 0.1 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
398 80 11.9 0.1 0.0 penalty=repeat_penalty,
399 )
400 80 338.2 4.2 0.6 llama_cpp.llama_sample_frequency_and_presence_penalties(
401 80 25.8 0.3 0.0 ctx=self.ctx,
402 80 55.8 0.7 0.1 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
403 80 16.2 0.2 0.0 last_tokens_data=last_n_tokens_data,
404 80 26.2 0.3 0.0 last_tokens_size=last_n_tokens_size,
405 80 12.2 0.2 0.0 alpha_frequency=frequency_penalty,
406 80 11.8 0.1 0.0 alpha_presence=presence_penalty,
407 )
408 80 27.7 0.3 0.1 if not penalize_nl:
409 candidates.data[self._token_nl].logit = llama_cpp.c_float(nl_logit)
410 80 52.0 0.6 0.1 if temp.value == 0.0:
411 return llama_cpp.llama_sample_token_greedy(
412 ctx=self.ctx,
413 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
414 )
415 80 40.8 0.5 0.1 elif mirostat_mode.value == 1:
416 mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
417 mirostat_m = llama_cpp.c_int(100)
418 llama_cpp.llama_sample_temperature(
419 ctx=self.ctx,
420 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
421 temp=temp,
422 )
423 return llama_cpp.llama_sample_token_mirostat(
424 ctx=self.ctx,
425 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
426 tau=mirostat_tau,
427 eta=mirostat_eta,
428 mu=llama_cpp.ctypes.byref(mirostat_mu), # type: ignore
429 m=mirostat_m,
430 )
431 80 34.2 0.4 0.1 elif mirostat_mode.value == 2:
432 mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
433 llama_cpp.llama_sample_temperature(
434 ctx=self.ctx,
435 candidates=llama_cpp.ctypes.pointer(candidates),
436 temp=temp,
437 )
438 return llama_cpp.llama_sample_token_mirostat_v2(
439 ctx=self.ctx,
440 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
441 tau=mirostat_tau,
442 eta=mirostat_eta,
443 mu=llama_cpp.ctypes.byref(mirostat_mu), # type: ignore
444 )
445 else:
446 80 2287.8 28.6 4.1 llama_cpp.llama_sample_top_k(
447 80 44.3 0.6 0.1 ctx=self.ctx,
448 80 36.5 0.5 0.1 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
449 80 29.7 0.4 0.1 k=top_k,
450 80 33.4 0.4 0.1 min_keep=llama_cpp.c_size_t(1),
451 )
452 80 178.4 2.2 0.3 llama_cpp.llama_sample_tail_free(
453 80 22.2 0.3 0.0 ctx=self.ctx,
454 80 44.5 0.6 0.1 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
455 80 25.8 0.3 0.0 z=tfs_z,
456 80 29.7 0.4 0.1 min_keep=llama_cpp.c_size_t(1),
457 )
458 80 127.9 1.6 0.2 llama_cpp.llama_sample_typical(
459 80 25.1 0.3 0.0 ctx=self.ctx,
460 80 33.1 0.4 0.1 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
461 80 36.4 0.5 0.1 p=llama_cpp.c_float(1.0),
462 80 28.1 0.4 0.1 min_keep=llama_cpp.c_size_t(1),
463 )
464 80 166.2 2.1 0.3 llama_cpp.llama_sample_top_p(
465 80 31.3 0.4 0.1 ctx=self.ctx,
466 80 26.8 0.3 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
467 80 16.3 0.2 0.0 p=top_p,
468 80 25.3 0.3 0.0 min_keep=llama_cpp.c_size_t(1),
469 )
470 80 155.0 1.9 0.3 llama_cpp.llama_sample_temperature(
471 80 28.5 0.4 0.1 ctx=self.ctx,
472 80 31.3 0.4 0.1 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
473 80 15.9 0.2 0.0 temp=temp,
474 )
475 80 206.7 2.6 0.4 return llama_cpp.llama_sample_token(
476 80 23.3 0.3 0.0 ctx=self.ctx,
477 80 25.9 0.3 0.0 candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
478 )
@AlphaAtlas got the numpy implementation working and it seems to improve the performance as excpected, the PR is still open #277 here but I should be merging it soon
Any idea when this might be merged to a new release build?
@brandonj60 just merged and published to v0.1.56
@AlphaAtlas do you mind testing with the latest version? There will still be some gpu utilization drop (sampling is not gpu accelerated afaik) but it should generally be faster.
Interestingly, the new numpy version breaks functiontrace:
Traceback (most recent call last):
File "/home/alpha/AI/text-generation-webui/perftest.py", line 5, in <module>
llm = Llama(model_path="/home/alpha/Storage/AIModels/textui/ggmls/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin", n_gpu_layers=100, n_threads=8, n_ctx=2048, use_mlock=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/alpha/AI/text-generation-webui/venv/lib/python3.11/site-packages/llama_cpp/llama.py", line 225, in __init__
self._candidates_data.resize(3, self._n_vocab)
ValueError: cannot resize an array that references or is referenced
by another array in this way.
Use the np.resize function or refcheck=False
The gaps in the Nvidia profile are still there, but very small now :+1:
I do see the sampling GPU usage drops... maybe it doesn't matter? I will just time llama.cpp vs llama-cpp-python generating a set number of tokens.
I timed fresh builds on the llama-cpp-python script and a llama.cpp ./main
script with the same large context, 120 tokens each.
They are both consistently ~28 seconds, with a half second spread or so... stream=True
doesn't seem to slow down llama-cpp-python much either.
Unless I find some other evidence llama-cpp-python is slower, I think this issue is thoroughly fixed. :tada:
Thanks :+1:
After noticing a big, visibly noticeable slowdown in the ooba text ui compared to llama.cpp, I wrote a test script to profile llama-cpp-python's high level API:
And at first glance, everything looks fine, with differences within an margin of error:
llama-cpp-python test script:
llama.cpp ./main:
So I used Nvidia nsys to profile the generation with
sudo nsys profile --gpu-metrics-device=0 python perftest.py
and then examine the generated reports withncu-ui
Here is a snapshot of llama.cpp's utilization:![cpp2](https://github.com/abetlen/llama-cpp-python/assets/46462706/7687e16d-16b9-4ae5-8f89-29e35ebdbf95)
The CPU is fully saturated without interruption. The GPU is not being fully utilized, but is pretty consistently loaded as is to be expected.
Now, here is the current git commit of llama-cpp-python:![cpppython2](https://github.com/abetlen/llama-cpp-python/assets/46462706/b7d4e1e4-3044-46d4-a414-19966d50c559)
Seems there are long pauses where the only thread doing any work is the single python thread:![Screenshot_14](https://github.com/abetlen/llama-cpp-python/assets/46462706/6f4c9d35-7107-4df5-a121-46acfc1bd2a4)
@Firstbober seems to have discovered that the low level API is faster than the high level API: https://github.com/abetlen/llama-cpp-python/issues/181
And @eiery seems to think that this issue predates the cuda builds, though their token/s measurements don't line up with mine: https://github.com/oobabooga/text-generation-webui/issues/2088#issuecomment-1548872664