abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.21k stars 858 forks source link

Performance issues with high level API #232

Closed AlphaAtlas closed 1 year ago

AlphaAtlas commented 1 year ago

After noticing a big, visibly noticeable slowdown in the ooba text ui compared to llama.cpp, I wrote a test script to profile llama-cpp-python's high level API:

from llama_cpp import Llama
llm = Llama(model_path="/home/alpha/Storage/AIModels/textui/metharme-7b-4bit-ggml-q4_1/ggml-model-q4_1.bin", n_gpu_layers=31, n_threads=8)
output = llm("""test prompt goes here""", max_tokens=300, stop=[], echo=True)
print(output)

And at first glance, everything looks fine, with differences within an margin of error:

llama-cpp-python test script:

llama_print_timings:      sample time =    31.18 ms /    89 runs   (    0.35 ms per token)
llama_print_timings: prompt eval time =   689.20 ms /    52 tokens (   13.25 ms per token)
llama_print_timings:        eval time =  5664.22 ms /    88 runs   (   64.37 ms per token)

llama.cpp ./main:

llama_print_timings:      sample time =    95.47 ms /   145 runs   (    0.66 ms per token)
llama_print_timings: prompt eval time =   686.61 ms /    53 tokens (   12.95 ms per token)
llama_print_timings:        eval time =  9823.06 ms /   144 runs   (   68.22 ms per token)

So I used Nvidia nsys to profile the generation with sudo nsys profile --gpu-metrics-device=0 python perftest.py and then examine the generated reports with ncu-ui

Here is a snapshot of llama.cpp's utilization: cpp2

The CPU is fully saturated without interruption. The GPU is not being fully utilized, but is pretty consistently loaded as is to be expected.

Now, here is the current git commit of llama-cpp-python: cpppython2

Seems there are long pauses where the only thread doing any work is the single python thread: Screenshot_14

@Firstbober seems to have discovered that the low level API is faster than the high level API: https://github.com/abetlen/llama-cpp-python/issues/181

And @eiery seems to think that this issue predates the cuda builds, though their token/s measurements don't line up with mine: https://github.com/oobabooga/text-generation-webui/issues/2088#issuecomment-1548872664

AlphaAtlas commented 1 year ago

Here is a graph of koboldcpp for reference, which generates text in "chunks" for its streaming mode:

koboldcpp2

AlphaAtlas commented 1 year ago

Also, I can upload these profile files if anyone wants to take a peek for themselves, just ask. :+1:

The next step is to profile the low level API and profile both APIs with py spy, but I don't have either set up yet.

abetlen commented 1 year ago

@AlphaAtlas thank you for the detailed report, aside from py-spy it may also be useful to profile memory allocation / usage.

My guess is that it's to do with memory allocations, that's one area I've probably been a little too lazy / unprincipled. I would guess it has to do with allocating / freeing new candidates arrays between calls to the llama.cpp sampling functions, this should really only be done once.

ghost commented 1 year ago

@AlphaAtlas If you compare timings between llama.cpp and llama-cpp-python with CPU only do you see a difference?

ghost commented 1 year ago

I did some tests with your script as that's a better way to test the lib itself directly. Here are my results; I modified the script to run on 4 threads as that's what my computer can do. I also used a longer prompt to test both ingestion and generation. All tests were done on 7B q4_0 with OpenBLAS.

original llama.cpp

llama_print_timings:        load time = 28131.71 ms
llama_print_timings:      sample time =   341.41 ms /   100 runs   (    3.41 ms per token)
llama_print_timings: prompt eval time = 26056.49 ms /   353 tokens (   73.81 ms per token)
llama_print_timings:        eval time = 25333.35 ms /    99 runs   (  255.89 ms per token)
llama_print_timings:       total time = 53818.36 ms

llama-cpp-python with script

llama_print_timings:        load time = 13476.72 ms
llama_print_timings:      sample time =    65.66 ms /   100 runs   (    0.66 ms per token)
llama_print_timings: prompt eval time = 13476.50 ms /   125 tokens (  107.81 ms per token)
llama_print_timings:        eval time = 22492.65 ms /    99 runs   (  227.20 ms per token)
llama_print_timings:       total time = 41234.30 ms

I ran both of these examples multiple times and the results above are representative of what I saw on average. What's interesting is that llama-cpp-python is slightly faster in generation, possibly due to lack of streaming.

With the webui, eval time is still in the order of ~350ms a token.

ghost commented 1 year ago

Here are webui results with streaming turned on and off as a comparison. Without streaming webui results are matching what I'm seeing with @AlphaAtlas's script.

Streaming on:

llama_print_timings:        eval time = 25196.29 ms /    77 runs   (  327.22 ms per token)

Streaming off:

llama_print_timings:        eval time = 22490.24 ms /    99 runs   (  227.17 ms per token)
AlphaAtlas commented 1 year ago

I did some tests with your script as that's a better way to test the lib itself directly. Here are my results; I modified the script to run on 4 threads as that's what my computer can do. I also used a longer prompt to test both ingestion and generation. All tests were done on 7B q4_0 with OpenBLAS.

original llama.cpp

llama_print_timings:        load time = 28131.71 ms
llama_print_timings:      sample time =   341.41 ms /   100 runs   (    3.41 ms per token)
llama_print_timings: prompt eval time = 26056.49 ms /   353 tokens (   73.81 ms per token)
llama_print_timings:        eval time = 25333.35 ms /    99 runs   (  255.89 ms per token)
llama_print_timings:       total time = 53818.36 ms

llama-cpp-python with script

llama_print_timings:        load time = 13476.72 ms
llama_print_timings:      sample time =    65.66 ms /   100 runs   (    0.66 ms per token)
llama_print_timings: prompt eval time = 13476.50 ms /   125 tokens (  107.81 ms per token)
llama_print_timings:        eval time = 22492.65 ms /    99 runs   (  227.20 ms per token)
llama_print_timings:       total time = 41234.30 ms

I ran both of these examples multiple times and the results above are representative of what I saw on average. What's interesting is that llama-cpp-python is slightly faster in generation, possibly due to lack of streaming.

With the webui, eval time is still in the order of ~350ms a token.

I think you have to take the results with a grain of salt, as the timings for me were almost exactly the same even though llama-cpp-python was visibly slower. Maybe llama.cpp is "sleeping" some of the time and hence that time is not part of its performance metrics?

Not sure about the llama.cpp metrics difference... Maybe Python is eating some CPU time doing something, which would be more visible on a 4 thread machine.

abetlen commented 1 year ago

@eiery @AlphaAtlas the the llama.cpp timings are only the portion of time spent in the llama.cpp eval and sample library functions not the rest of the program. The flamegraph would show you a better indication of total times but could be susceptible to errors from the sampling method.

AlphaAtlas commented 1 year ago

Here is llama-cpp-python with gpu-layers set to 0, as expected it still processes the prompt but doesn't use the GPU afterwards. Looks like its still bound to a single thread in intervals: Screenshot_15

Here is Functiontrace lined up with the cuda profiler... sort of. The profiles don't start at the same time so it may be offset, but the intervals look about right: Screenshot_16

Also all this was done on the current git commit.

Still need to test the low level API, and memory allocation. If benching it is still necessary, would you mind posting an example "benchmark" script for the low level API?

AlphaAtlas commented 1 year ago

Oh and here is the full python profile. You can open it with https://profiler.firefox.com.

functiontrace.2023-05-19-16:47:26.718.json.gz

abetlen commented 1 year ago

@AlphaAtlas based on your latest screenshot

image

Quite a bit of time is being spent on a list comprehension inside of the sampling methods. I've pushed a commit that replaces this list comprehension with a simple assignment to a data structure I create once when the Llama object is initialized, I've profile the change and it does seem to reduce sampling time in practice.

Let me know if it's any better for you.

AlphaAtlas commented 1 year ago

Sorry my testing is so intermittent! The latest commit did improve the gap, but the "downtime" where most cores are idle is still there:

Screenshot_18

Above ^ test is the latest commit with CUBLAS and a q4_0 model. And this time I am using the full 2000 token context instead of a short testing prompt.

AlphaAtlas commented 1 year ago

Also, here are the full steps for testing (on linux) in case anyone wants to try this out.

You can just use functiontrace and a CPU graph if you don't use the Nvidia profiler.

abetlen commented 1 year ago

@AlphaAtlas I think I've narrowed it down now, it looks like there's about 20ms pause per sample where it's copying the last logits into the candidates token_data_array. Other than that it looks like the majority of time is spent inside the c++ functions in the shared library. I'll work on trying to remove this delay though it's a little challenging as I can't just memcpy the logits because the candidate data is an array of structs.

Here's the line-by-line profiling I got by using line_profiler (see the _sample function)

Wrote profile results to perftest.py.lprof
Timer unit: 1e-06 s

Total time: 45.4287 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: eval at line 271

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   271                                               @profile
   272                                               def eval(self, tokens: Sequence[int]):
   273                                                   """Evaluate a list of tokens.
   274                                           
   275                                                   Args:
   276                                                       tokens: The list of tokens to evaluate.
   277                                                   """
   278        80         26.6      0.3      0.0          assert self.ctx is not None
   279        80         18.0      0.2      0.0          n_ctx = self._n_ctx
   280        83        129.1      1.6      0.0          for i in range(0, len(tokens), self.n_batch):
   281        83        118.8      1.4      0.0              batch = tokens[i : min(len(tokens), i + self.n_batch)]
   282        83         58.2      0.7      0.0              n_past = min(n_ctx - len(batch), len(self.eval_tokens))
   283        83         19.0      0.2      0.0              n_tokens = len(batch)
   284        83   45370154.8 546628.4     99.9              return_code = llama_cpp.llama_eval(
   285        83         17.8      0.2      0.0                  ctx=self.ctx,
   286        83        375.8      4.5      0.0                  tokens=(llama_cpp.llama_token * len(batch))(*batch),
   287        83         34.3      0.4      0.0                  n_tokens=llama_cpp.c_int(n_tokens),
   288        83         20.4      0.2      0.0                  n_past=llama_cpp.c_int(n_past),
   289        83         32.1      0.4      0.0                  n_threads=llama_cpp.c_int(self.n_threads),
   290                                                       )
   291        83        241.6      2.9      0.0              if return_code != 0:
   292                                                           raise RuntimeError(f"llama_eval returned {return_code}")
   293                                                       # Save tokens
   294        83        468.0      5.6      0.0              self.eval_tokens.extend(batch)
   295                                                       # Save logits
   296        83        246.0      3.0      0.0              rows = n_tokens if self.params.logits_all else 1
   297        83         56.7      0.7      0.0              n_vocab = self._n_vocab
   298        83         82.6      1.0      0.0              cols = n_vocab
   299        83       1145.8     13.8      0.0              logits_view = llama_cpp.llama_get_logits(self.ctx)
   300        83      40884.4    492.6      0.1              logits = [logits_view[i * cols : (i + 1) * cols] for i in range(rows)]
   301        83      14575.1    175.6      0.0              self.eval_logits.extend(logits)

Total time: 2.81234 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: _sample at line 303

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   303                                               @profile
   304                                               def _sample(
   305                                                   self,
   306                                                   last_n_tokens_data,  # type: llama_cpp.Array[llama_cpp.llama_token]
   307                                                   last_n_tokens_size: llama_cpp.c_int,
   308                                                   top_k: llama_cpp.c_int,
   309                                                   top_p: llama_cpp.c_float,
   310                                                   temp: llama_cpp.c_float,
   311                                                   tfs_z: llama_cpp.c_float,
   312                                                   repeat_penalty: llama_cpp.c_float,
   313                                                   frequency_penalty: llama_cpp.c_float,
   314                                                   presence_penalty: llama_cpp.c_float,
   315                                                   mirostat_mode: llama_cpp.c_int,
   316                                                   mirostat_tau: llama_cpp.c_float,
   317                                                   mirostat_eta: llama_cpp.c_float,
   318                                                   penalize_nl: bool = True,
   319                                               ):
   320        80         54.6      0.7      0.0          assert self.ctx is not None
   321                                                   # assert len(self.eval_logits) > 0
   322        80         27.4      0.3      0.0          n_vocab = self._n_vocab
   323        80         21.8      0.3      0.0          n_ctx = self._n_ctx
   324        80         53.6      0.7      0.0          top_k = llama_cpp.c_int(n_vocab) if top_k.value <= 0 else top_k
   325        80         10.8      0.1      0.0          last_n_tokens_size = (
   326        80         11.9      0.1      0.0              llama_cpp.c_int(n_ctx)
   327        80         16.2      0.2      0.0              if last_n_tokens_size.value < 0
   328        80         10.3      0.1      0.0              else last_n_tokens_size
   329                                                   )
   330        80         37.6      0.5      0.0          logits = self.eval_logits[-1]
   331        80         25.6      0.3      0.0          nl_logit = logits[self._token_nl]
   332        80         17.6      0.2      0.0          candidates = self._candidates
   333   2560080     713785.8      0.3     25.4          for i, (data, logit) in enumerate(zip(candidates.data, logits)):
   334   2560080     704819.0      0.3     25.1              data.id = llama_cpp.llama_token(i)
   335   2560080     655928.1      0.3     23.3              data.logit = llama_cpp.c_float(logit)
   336   2560080     699865.4      0.3     24.9              data.p = llama_cpp.c_float(0.0)
   337        80         55.8      0.7      0.0          candidates.sorted = llama_cpp.c_bool(False)
   338        80         60.8      0.8      0.0          candidates.size = llama_cpp.c_size_t(n_vocab)
   339        80      33448.7    418.1      1.2          llama_cpp.llama_sample_repetition_penalty(
   340        80         26.4      0.3      0.0              ctx=self.ctx,
   341        80         16.5      0.2      0.0              last_tokens_data=last_n_tokens_data,
   342        80         12.1      0.2      0.0              last_tokens_size=last_n_tokens_size,
   343        80         87.6      1.1      0.0              candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   344        80         10.9      0.1      0.0              penalty=repeat_penalty,
   345                                                   )
   346        80        285.0      3.6      0.0          llama_cpp.llama_sample_frequency_and_presence_penalties(
   347        80         18.6      0.2      0.0              ctx=self.ctx,
   348        80         39.3      0.5      0.0              candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   349        80         18.9      0.2      0.0              last_tokens_data=last_n_tokens_data,
   350        80         14.5      0.2      0.0              last_tokens_size=last_n_tokens_size,
   351        80         16.5      0.2      0.0              alpha_frequency=frequency_penalty,
   352        80          9.5      0.1      0.0              alpha_presence=presence_penalty,
   353                                                   )
   354        80         16.7      0.2      0.0          if not penalize_nl:
   355                                                       candidates.data[self._token_nl].logit = llama_cpp.c_float(nl_logit)
   356        80         64.3      0.8      0.0          if temp.value == 0.0:
   357                                                       return llama_cpp.llama_sample_token_greedy(
   358                                                           ctx=self.ctx,
   359                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   360                                                       )
   361        80         27.4      0.3      0.0          elif mirostat_mode.value == 1:
   362                                                       mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
   363                                                       mirostat_m = llama_cpp.c_int(100)
   364                                                       llama_cpp.llama_sample_temperature(
   365                                                           ctx=self.ctx,
   366                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   367                                                           temp=temp,
   368                                                       )
   369                                                       return llama_cpp.llama_sample_token_mirostat(
   370                                                           ctx=self.ctx,
   371                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   372                                                           tau=mirostat_tau,
   373                                                           eta=mirostat_eta,
   374                                                           mu=llama_cpp.ctypes.byref(mirostat_mu),  # type: ignore
   375                                                           m=mirostat_m,
   376                                                       )
   377        80         25.2      0.3      0.0          elif mirostat_mode.value == 2:
   378                                                       mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
   379                                                       llama_cpp.llama_sample_temperature(
   380                                                           ctx=self.ctx,
   381                                                           candidates=llama_cpp.ctypes.pointer(candidates),
   382                                                           temp=temp,
   383                                                       )
   384                                                       return llama_cpp.llama_sample_token_mirostat_v2(
   385                                                           ctx=self.ctx,
   386                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   387                                                           tau=mirostat_tau,
   388                                                           eta=mirostat_eta,
   389                                                           mu=llama_cpp.ctypes.byref(mirostat_mu),  # type: ignore
   390                                                       )
   391                                                   else:
   392        80       2192.0     27.4      0.1              llama_cpp.llama_sample_top_k(
   393        80         17.0      0.2      0.0                  ctx=self.ctx,
   394        80         27.9      0.3      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   395        80         13.0      0.2      0.0                  k=top_k,
   396        80         32.6      0.4      0.0                  min_keep=llama_cpp.c_size_t(1),
   397                                                       )
   398        80        159.8      2.0      0.0              llama_cpp.llama_sample_tail_free(
   399        80         20.5      0.3      0.0                  ctx=self.ctx,
   400        80         34.4      0.4      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   401        80         19.6      0.2      0.0                  z=tfs_z,
   402        80         29.7      0.4      0.0                  min_keep=llama_cpp.c_size_t(1),
   403                                                       )
   404        80        105.2      1.3      0.0              llama_cpp.llama_sample_typical(
   405        80         23.0      0.3      0.0                  ctx=self.ctx,
   406        80         25.6      0.3      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   407        80         25.2      0.3      0.0                  p=llama_cpp.c_float(1.0),
   408        80         22.2      0.3      0.0                  min_keep=llama_cpp.c_size_t(1),
   409                                                       )
   410        80        157.1      2.0      0.0              llama_cpp.llama_sample_top_p(
   411        80         17.2      0.2      0.0                  ctx=self.ctx,
   412        80         23.9      0.3      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   413        80         14.5      0.2      0.0                  p=top_p,
   414        80         22.1      0.3      0.0                  min_keep=llama_cpp.c_size_t(1),
   415                                                       )
   416        80        130.3      1.6      0.0              llama_cpp.llama_sample_temperature(
   417        80         16.8      0.2      0.0                  ctx=self.ctx,
   418        80         24.6      0.3      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   419        80         15.0      0.2      0.0                  temp=temp,
   420                                                       )
   421        80        192.2      2.4      0.0              return llama_cpp.llama_sample_token(
   422        80         16.4      0.2      0.0                  ctx=self.ctx,
   423        80         24.9      0.3      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   424                                                       )

Total time: 50.3015 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: generate at line 473

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   473                                               @profile
   474                                               def generate(
   475                                                   self,
   476                                                   tokens: Sequence[int],
   477                                                   top_k: int = 40,
   478                                                   top_p: float = 0.95,
   479                                                   temp: float = 0.80,
   480                                                   repeat_penalty: float = 1.1,
   481                                                   reset: bool = True,
   482                                                   frequency_penalty: float = 0.0,
   483                                                   presence_penalty: float = 0.0,
   484                                                   tfs_z: float = 1.0,
   485                                                   mirostat_mode: int = 0,
   486                                                   mirostat_tau: float = 5.0,
   487                                                   mirostat_eta: float = 0.1,
   488                                               ) -> Generator[int, Optional[Sequence[int]], None]:
   489                                                   """Create a generator of tokens from a prompt.
   490                                           
   491                                                   Examples:
   492                                                       >>> llama = Llama("models/ggml-7b.bin")
   493                                                       >>> tokens = llama.tokenize(b"Hello, world!")
   494                                                       >>> for token in llama.generate(tokens, top_k=40, top_p=0.95, temp=1.0, repeat_penalty=1.1):
   495                                                       ...     print(llama.detokenize([token]))
   496                                           
   497                                                   Args:
   498                                                       tokens: The prompt tokens.
   499                                                       top_k: The top-k sampling parameter.
   500                                                       top_p: The top-p sampling parameter.
   501                                                       temp: The temperature parameter.
   502                                                       repeat_penalty: The repeat penalty parameter.
   503                                                       reset: Whether to reset the model state.
   504                                           
   505                                                   Yields:
   506                                                       The generated tokens.
   507                                                   """
   508         1          0.9      0.9      0.0          assert self.ctx is not None
   509                                           
   510         1          0.9      0.9      0.0          if reset and len(self.eval_tokens) > 0:
   511                                                       longest_prefix = 0
   512                                                       for a, b in zip(self.eval_tokens, tokens[:-1]):
   513                                                           if a == b:
   514                                                               longest_prefix += 1
   515                                                           else:
   516                                                               break
   517                                                       if longest_prefix > 0:
   518                                                           if self.verbose:
   519                                                               print("Llama.generate: prefix-match hit", file=sys.stderr)
   520                                                           reset = False
   521                                                           tokens = tokens[longest_prefix:]
   522                                                           for _ in range(len(self.eval_tokens) - longest_prefix):
   523                                                               self.eval_tokens.pop()
   524                                                               try:
   525                                                                   self.eval_logits.pop()
   526                                                               except IndexError:
   527                                                                   pass
   528                                           
   529         1          0.4      0.4      0.0          if reset:
   530         1          2.3      2.3      0.0              self.reset()
   531                                           
   532                                                   while True:
   533        80   45430708.1 567883.9     90.3              self.eval(tokens)
   534        80    4870570.6  60882.1      9.7              token = self.sample(
   535        80         28.4      0.4      0.0                  top_k=top_k,
   536        80         12.2      0.2      0.0                  top_p=top_p,
   537        80         20.7      0.3      0.0                  temp=temp,
   538        80          9.7      0.1      0.0                  repeat_penalty=repeat_penalty,
   539        80         28.9      0.4      0.0                  frequency_penalty=frequency_penalty,
   540        80          9.9      0.1      0.0                  presence_penalty=presence_penalty,
   541        80         19.5      0.2      0.0                  tfs_z=tfs_z,
   542        80         21.2      0.3      0.0                  mirostat_mode=mirostat_mode,
   543        80         26.4      0.3      0.0                  mirostat_tau=mirostat_tau,
   544        80         12.4      0.2      0.0                  mirostat_eta=mirostat_eta,
   545                                                       )
   546        80         11.6      0.1      0.0              tokens_or_none = yield token
   547        79         37.6      0.5      0.0              tokens = [token]
   548        79         22.1      0.3      0.0              if tokens_or_none is not None:
   549                                                           tokens.extend(tokens_or_none)
abetlen commented 1 year ago

I think the solution is to just move to using numpy, it introduces an additional dependency but it should reduce memory usage and speed up a few sections like this.

laurids-reichardt commented 1 year ago

I think the solution is to just move to using numpy, it introduces an additional dependency but it should reduce memory usage and speed up a few sections like this.

Chances are high, whoever uses this package has numpy installed anyway in their environment.

abetlen commented 1 year ago

@AlphaAtlas got the numpy implementation working and it seems to improve the performance as excpected, the PR is still open #277 here but I should be merging it soon

Total time: 0.0551852 s
File: /home/andrei/Documents/llms/llama_cpp/llama.py
Function: _sample at line 345

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   345                                               @profile
   346                                               def _sample(
   347                                                   self,
   348                                                   last_n_tokens_data,  # type: llama_cpp.Array[llama_cpp.llama_token]
   349                                                   last_n_tokens_size: llama_cpp.c_int,
   350                                                   top_k: llama_cpp.c_int,
   351                                                   top_p: llama_cpp.c_float,
   352                                                   temp: llama_cpp.c_float,
   353                                                   tfs_z: llama_cpp.c_float,
   354                                                   repeat_penalty: llama_cpp.c_float,
   355                                                   frequency_penalty: llama_cpp.c_float,
   356                                                   presence_penalty: llama_cpp.c_float,
   357                                                   mirostat_mode: llama_cpp.c_int,
   358                                                   mirostat_tau: llama_cpp.c_float,
   359                                                   mirostat_eta: llama_cpp.c_float,
   360                                                   penalize_nl: bool = True,
   361                                                   logits_processor: Optional[LogitsProcessorList] = None,
   362                                               ):
   363        80         91.1      1.1      0.2          assert self.ctx is not None
   364        80         75.4      0.9      0.1          assert len(self.eval_logits) > 0
   365        80         89.2      1.1      0.2          assert self._scores.shape[0] > 0
   366        80         31.9      0.4      0.1          n_vocab = self._n_vocab
   367        80         35.1      0.4      0.1          n_ctx = self._n_ctx
   368        80         64.9      0.8      0.1          top_k = llama_cpp.c_int(n_vocab) if top_k.value <= 0 else top_k
   369        80         12.0      0.2      0.0          last_n_tokens_size = (
   370        80         12.4      0.2      0.0              llama_cpp.c_int(n_ctx)
   371        80         36.7      0.5      0.1              if last_n_tokens_size.value < 0
   372        80         23.1      0.3      0.0              else last_n_tokens_size
   373                                                   )
   374        80         86.9      1.1      0.2          logits: npt.NDArray[np.single] = self._scores[-1, :]
   375                                           
   376        80         37.6      0.5      0.1          if logits_processor is not None:
   377                                                       logits = np.array(
   378                                                           logits_processor(self._input_ids.tolist(), logits.tolist()),
   379                                                           dtype=np.single,
   380                                                       )
   381                                                       self._scores[-1, :] = logits
   382                                                       self.eval_logits[-1] = logits.tolist()
   383                                           
   384        80         98.0      1.2      0.2          nl_logit = logits[self._token_nl]
   385        80         29.1      0.4      0.1          candidates = self._candidates
   386        80         17.5      0.2      0.0          candidates_data = self._candidates_data
   387        80       6221.9     77.8     11.3          candidates_data["id"] = np.arange(n_vocab, dtype=np.intc)  # type: ignore
   388        80       4137.7     51.7      7.5          candidates_data["logit"] = logits
   389        80       3356.6     42.0      6.1          candidates_data["p"] = np.zeros(n_vocab, dtype=np.single)
   390        80       1924.4     24.1      3.5          candidates.data = candidates_data.ctypes.data_as(llama_cpp.llama_token_data_p)
   391        80         92.3      1.2      0.2          candidates.sorted = llama_cpp.c_bool(False)
   392        80         88.9      1.1      0.2          candidates.size = llama_cpp.c_size_t(n_vocab)
   393        80      34102.2    426.3     61.8          llama_cpp.llama_sample_repetition_penalty(
   394        80         29.6      0.4      0.1              ctx=self.ctx,
   395        80         17.7      0.2      0.0              last_tokens_data=last_n_tokens_data,
   396        80         20.5      0.3      0.0              last_tokens_size=last_n_tokens_size,
   397        80         64.2      0.8      0.1              candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   398        80         11.9      0.1      0.0              penalty=repeat_penalty,
   399                                                   )
   400        80        338.2      4.2      0.6          llama_cpp.llama_sample_frequency_and_presence_penalties(
   401        80         25.8      0.3      0.0              ctx=self.ctx,
   402        80         55.8      0.7      0.1              candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   403        80         16.2      0.2      0.0              last_tokens_data=last_n_tokens_data,
   404        80         26.2      0.3      0.0              last_tokens_size=last_n_tokens_size,
   405        80         12.2      0.2      0.0              alpha_frequency=frequency_penalty,
   406        80         11.8      0.1      0.0              alpha_presence=presence_penalty,
   407                                                   )
   408        80         27.7      0.3      0.1          if not penalize_nl:
   409                                                       candidates.data[self._token_nl].logit = llama_cpp.c_float(nl_logit)
   410        80         52.0      0.6      0.1          if temp.value == 0.0:
   411                                                       return llama_cpp.llama_sample_token_greedy(
   412                                                           ctx=self.ctx,
   413                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   414                                                       )
   415        80         40.8      0.5      0.1          elif mirostat_mode.value == 1:
   416                                                       mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
   417                                                       mirostat_m = llama_cpp.c_int(100)
   418                                                       llama_cpp.llama_sample_temperature(
   419                                                           ctx=self.ctx,
   420                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   421                                                           temp=temp,
   422                                                       )
   423                                                       return llama_cpp.llama_sample_token_mirostat(
   424                                                           ctx=self.ctx,
   425                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   426                                                           tau=mirostat_tau,
   427                                                           eta=mirostat_eta,
   428                                                           mu=llama_cpp.ctypes.byref(mirostat_mu),  # type: ignore
   429                                                           m=mirostat_m,
   430                                                       )
   431        80         34.2      0.4      0.1          elif mirostat_mode.value == 2:
   432                                                       mirostat_mu = llama_cpp.c_float(2.0 * mirostat_tau.value)
   433                                                       llama_cpp.llama_sample_temperature(
   434                                                           ctx=self.ctx,
   435                                                           candidates=llama_cpp.ctypes.pointer(candidates),
   436                                                           temp=temp,
   437                                                       )
   438                                                       return llama_cpp.llama_sample_token_mirostat_v2(
   439                                                           ctx=self.ctx,
   440                                                           candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   441                                                           tau=mirostat_tau,
   442                                                           eta=mirostat_eta,
   443                                                           mu=llama_cpp.ctypes.byref(mirostat_mu),  # type: ignore
   444                                                       )
   445                                                   else:
   446        80       2287.8     28.6      4.1              llama_cpp.llama_sample_top_k(
   447        80         44.3      0.6      0.1                  ctx=self.ctx,
   448        80         36.5      0.5      0.1                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   449        80         29.7      0.4      0.1                  k=top_k,
   450        80         33.4      0.4      0.1                  min_keep=llama_cpp.c_size_t(1),
   451                                                       )
   452        80        178.4      2.2      0.3              llama_cpp.llama_sample_tail_free(
   453        80         22.2      0.3      0.0                  ctx=self.ctx,
   454        80         44.5      0.6      0.1                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   455        80         25.8      0.3      0.0                  z=tfs_z,
   456        80         29.7      0.4      0.1                  min_keep=llama_cpp.c_size_t(1),
   457                                                       )
   458        80        127.9      1.6      0.2              llama_cpp.llama_sample_typical(
   459        80         25.1      0.3      0.0                  ctx=self.ctx,
   460        80         33.1      0.4      0.1                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   461        80         36.4      0.5      0.1                  p=llama_cpp.c_float(1.0),
   462        80         28.1      0.4      0.1                  min_keep=llama_cpp.c_size_t(1),
   463                                                       )
   464        80        166.2      2.1      0.3              llama_cpp.llama_sample_top_p(
   465        80         31.3      0.4      0.1                  ctx=self.ctx,
   466        80         26.8      0.3      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   467        80         16.3      0.2      0.0                  p=top_p,
   468        80         25.3      0.3      0.0                  min_keep=llama_cpp.c_size_t(1),
   469                                                       )
   470        80        155.0      1.9      0.3              llama_cpp.llama_sample_temperature(
   471        80         28.5      0.4      0.1                  ctx=self.ctx,
   472        80         31.3      0.4      0.1                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   473        80         15.9      0.2      0.0                  temp=temp,
   474                                                       )
   475        80        206.7      2.6      0.4              return llama_cpp.llama_sample_token(
   476        80         23.3      0.3      0.0                  ctx=self.ctx,
   477        80         25.9      0.3      0.0                  candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
   478                                                       )
brandonj60 commented 1 year ago

@AlphaAtlas got the numpy implementation working and it seems to improve the performance as excpected, the PR is still open #277 here but I should be merging it soon

Any idea when this might be merged to a new release build?

abetlen commented 1 year ago

@brandonj60 just merged and published to v0.1.56

abetlen commented 1 year ago

@AlphaAtlas do you mind testing with the latest version? There will still be some gpu utilization drop (sampling is not gpu accelerated afaik) but it should generally be faster.

AlphaAtlas commented 1 year ago

Interestingly, the new numpy version breaks functiontrace:

Traceback (most recent call last):
  File "/home/alpha/AI/text-generation-webui/perftest.py", line 5, in <module>
    llm = Llama(model_path="/home/alpha/Storage/AIModels/textui/ggmls/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin", n_gpu_layers=100, n_threads=8, n_ctx=2048, use_mlock=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/text-generation-webui/venv/lib/python3.11/site-packages/llama_cpp/llama.py", line 225, in __init__
    self._candidates_data.resize(3, self._n_vocab)
ValueError: cannot resize an array that references or is referenced
by another array in this way.
Use the np.resize function or refcheck=False

The gaps in the Nvidia profile are still there, but very small now :+1:

Screenshot_19 Screenshot_21

I do see the sampling GPU usage drops... maybe it doesn't matter? I will just time llama.cpp vs llama-cpp-python generating a set number of tokens.

AlphaAtlas commented 1 year ago

I timed fresh builds on the llama-cpp-python script and a llama.cpp ./main script with the same large context, 120 tokens each.

They are both consistently ~28 seconds, with a half second spread or so... stream=True doesn't seem to slow down llama-cpp-python much either.

Unless I find some other evidence llama-cpp-python is slower, I think this issue is thoroughly fixed. :tada:

Thanks :+1: