Prerequisites

Please answer the following questions for yourself before submitting an issue.

[v0.3.1 ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x ] I carefully followed the README.md.
[x ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama-cpp-python to do.

with logits_all = True, the scores shape should be (supplied_ctx, n_vocab) and the dtype should reflect the actual dtype.

Current Behavior

llama_cpp.llama_get_logits() returns an array of c_type.c_float, which is always fp32. Almost no GGUF model you can find is actually keeping output weights in fp32. For unquantized models, it is typically fp16 or bf16. For k-type quantized models with the "L" (large) type, the output weights are fp8. Otherwise (k_m, k_s, etc) they may be smaller still. It is understandable that a np.ctypeslib.as_array() can't meaningfully (or at least easily) go below fp16, but it should at least go that far.

For my model, with 8192 context and 256000 vocabulary, that equates to a 4GB error (in terms of wasted space) per chunk.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

system_profiler SPHardwareDataType ... Model Name: MacBook Pro Model Identifier: Mac15,9 Model Number: Z1CM0013LLL/A Chip: Apple M3 Max Total Number of Cores: 16 (12 performance and 4 efficiency) Memory: 48 GB OS Loader Version: 11881.41.5 ...

Operating System, e.g. for Linux:

$ uname -a

uname -a Darwin xiao-mbp 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:05:23 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6031 arm64

SDK version, e.g. for Linux:

$ python3 --version
$ make --version
$ g++ --version

python3 --version Python 3.12.0

I have the llama.cpp installed from homebrew, I had no compile step: llama-server --version version: 3912 (edc26566) built with Apple clang version 16.0.0 (clang-1600.0.26.3) for arm64-apple-darwin24.0.0

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

    with open(os.devnull, 'w') as f, contextlib.redirect_stderr(f):
        _ = model(tokens_chunk)
    logger.debug(f"Logits shape {model.scores.shape} dtype {model.scores.dtype}")

[llama_gguf_optmize v0.5.2] 17:49:28 - DEBUG - Logits shape (8192, 256000) dtype float32

that output with the bf16 uncompressed gguf of a model with this config:

more config.json
{
  "_name_or_path": "/gpfs/projects/bsc88/text/models/instruction-tuning/models/base_models_with_special_tokens/bsc_2b_4thepoch_hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5440,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "num_layers": 24,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 256000
}

abetlen / llama-cpp-python

save logits section in eval() sets dtype to np32 apparently unconditionally? #1829