Garbled output from model in Unity

Hi! I was trying to get the latest version of LLamaSharp working in Unity.

This is my script:

using System.Collections.Generic;
using UnityEngine;
using LLama.Common;
using LLama;

public class LLaMASharpTest : MonoBehaviour
{
    // Start is called before the first frame update
    void Start()
    {
        DoTalkWithLlamaCpp("Who made Linux?");
    }

    private async void DoTalkWithLlamaCpp(string userRequest)
    {
        string modelPath = "path/to/llama-2-7b-guanaco-qlora.Q2_K.gguf";

        var parameters = new ModelParams(modelPath)
        {
            ContextSize = 1024,
            Seed = 1337
        };
        using var model = LLamaWeights.LoadFromFile(parameters);
        using var context = model.CreateContext(parameters);
        var executor = new InteractiveExecutor(context);

        var session = new ChatSession(executor);

        await foreach (var text in session.ChatAsync(userRequest, new InferenceParams() { Temperature = 0.6f, AntiPrompts = new List<string> { "User:" }, MaxTokens = 100 }))
        {
            Debug.Log(text);
        }
    }
}

And this is my output:

I am on Unity 2022.3, .NET Standard 2.1 and am using TheBloke/llama-2-7B-Guanaco-QLoRA-GGUF as it is listed under 'Verified Model Resources' in the readme. My code is mostly derived from ChatSessionWithRoleName.cs.

I did make a few changes in the LLamaSharp code -

In LLama\Exceptions\GrammarFormatExceptions.cs I changed


namespace LLama.Exceptions;

public abstract class GrammarFormatException
...

using namespace LLama.Exceptions { public abstract class GrammarFormatException ... }

* Similar changes in `EncodingExtensions.cs`, `LLamaBeamsState.cs`, `LLamaBeamView.cs` and `NativeApi.BeamSearch.cs`
* Removed all references of `using Microsoft.Extensions.Logging;` and `ILogger`. 
* Replaced calls to `ILogger` instances with UnityEngine's `Debug.Log()`, `Debug.LogWarning()` and `Debug.LogError()`.
* Replaced all `#if NETSTANDARD2_0` and `#if !NETSTANDARD2_0 ` with `#if !NETSTANDARD2_1` and `#if NETSTANDARD2_1` respectively, as I believe they are compatible.
* In `LLamaContext.cs`, I replaced `var last_n_array = lastTokens.TakeLast(last_n_repeat).ToArray();` with `var last_n_array = IEnumerableExtensions.TakeLast(lastTokens, last_n_repeat).ToArray();` as I was getting the error `The call is ambiguous between the following methods or properties: 'System.Linq.Enumerable.TakeLast<TSource>(System.Collections.Generic.IEnumerable<TSource>, int)' and 'LLama.Extensions.IEnumerableExtensions.TakeLast<T>(System.Collections.Generic.IEnumerable<T>, int)'`.

Please do tell me if I did something wrong.

Thanks in advance!

#### Edit
If I ever decrease `ContextSize` from 1024 or increase `MaxTokens` to above ~150, Unity just crashes. I have narrowed the crash down to
```csharp
public bool Eval(ReadOnlySpan<int> tokens, int n_past, int n_threads)
{
    unsafe
    {
        fixed (int* pinned = tokens)
        {
            return NativeApi.llama_eval_with_pointer(this, pinned, tokens.Length, n_past, n_threads) == 0;
        }
    }
}

in SafeLLamaContextHandle.cs.

Update

I downloaded LLamaSharp again and compiled the example project (LLama.Examples) with the same llama-2-7b-guanaco-qlora.Q2_K.gguf model and it works! So my issue has something to do with my changes or with the C# environment that Unity uses.

Sorry for seeing this issue late and I'm so happy that you've got it worked. In fact none of the main contributors of LLamaSharp knows much about unity, which sometimes confuses us.

In this case, is there anything else to do to run LLamaSharp with unity compared to running it with net core app? We will appreciate it if you would like to help with the document about how to use it in unity ^_^

Unity just crashes. I have narrowed the crash down to

Do you have any more information on what kind of exception is happening? (e.g. is it an AccessViolationException?)

Sorry for seeing this issue late and I'm so happy that you've got it worked. In fact none of the main contributors of LLamaSharp knows much about unity, which sometimes confuses us.

In this case, is there anything else to do to run LLamaSharp with unity compared to running it with net core app? We will appreciate it if you would like to help with the document about how to use it in unity ^_^

Sorry for the late reply, but I could not get it working... I have narrowed the error down to the function I specified in the post, though! I also think the issue could be due to the dotnet environment Unity is using, as I tested the examples in Visual Studio and they were working. It may also be due to the changes I made.

Unity just crashes. I have narrowed the crash down to

Do you have any more information on what kind of exception is happening? (e.g. is it an AccessViolationException?)

I'll check what the exception is and let you know...

Here is what I found in the editor logs:

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:/Users/asus/Downloads/llama-2-7b-guanaco-qlora.Q2_K.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q2_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q3_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:   65 tensors
llama_model_loader: - type q3_K:  160 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 2048
llm_load_print_meta: n_ctx          = 1024
llm_load_print_meta: n_embd         = 4096
llm_load_print_meta: n_head         = 32
llm_load_print_meta: n_head_kv      = 32
llm_load_print_meta: n_layer        = 32
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 11008
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 7B
llm_load_print_meta: model ftype    = mostly Q2_K
llm_load_print_meta: model size     = 6.74 B
llm_load_print_meta: general.name   = mikael110_llama-2-7b-guanaco-fp16
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 2694.41 MB (+  512.00 MB per state)
.................................................................................................
llama_new_context_with_model: kv self size  =  512.00 MB
llama_new_context_with_model: compute buffer total size =   89.41 MB

=================================================================
    Native Crash Reporting
=================================================================
Got a UNKNOWN while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
    Managed Stacktrace:
=================================================================
      at <unknown> <0xffffffff>
      at LLama.Native.NativeApi:llama_eval_with_pointer <0x000e7>
      at LLama.Native.SafeLLamaContextHandle:Eval <0x0005a>
      at LLama.LLamaContext:Eval <0x00182>
      at LLama.LLamaContext:Eval <0x00132>
      at LLama.LLamaContext:Eval <0x00082>
      at <InferInternal>d__10:MoveNext <0x0010a>
      at System.Runtime.CompilerServices.AsyncTaskMethodBuilder:Start <0x000d2>
      at LLama.InteractiveExecutor:InferInternal <0x000f2>
      at <InferAsync>d__29:MoveNext <0x005ba>
      at System.Runtime.CompilerServices.AsyncMethodBuilderCore:Start <0x00080>
      at <InferAsync>d__29:System.Collections.Generic.IAsyncEnumerator<System.String>.MoveNextAsync <0x00112>
      at Enumerator:MoveNextAsync <0x0005c>
      at <ChatAsyncInternal>d__25:MoveNext <0x003b2>
      at System.Runtime.CompilerServices.AsyncMethodBuilderCore:Start <0x00080>
      at <ChatAsyncInternal>d__25:System.Collections.Generic.IAsyncEnumerator<System.String>.MoveNextAsync <0x00112>
      at <ChatAsync>d__23:MoveNext <0x00597>
      at System.Runtime.CompilerServices.AsyncMethodBuilderCore:Start <0x00080>
      at <ChatAsync>d__23:System.Collections.Generic.IAsyncEnumerator<System.String>.MoveNextAsync <0x00112>
      at <DoTalkWithLlamaCpp>d__23:MoveNext <0x0085c>
      at System.Runtime.CompilerServices.AsyncVoidMethodBuilder:Start <0x000da>
      at GameManager:DoTalkWithLlamaCpp <0x000fa>
      at GameManager:LoadRPMAvatar <0x000a2>
      at GameManager:Start <0x00462>
      at System.Object:runtime_invoke_void__this__ <0x00087>
=================================================================
Received signal SIGSEGV
Obtained 2 stack frames
RtlLookupFunctionEntry returned NULL function. Aborting stack walk.
<Missing stacktrace information>

I've done some googling on this error:

Got a UNKNOWN while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application.

It seems like it indicates something very wrong inside the Mono runtime, all the mentions of it I could find were associated with things like buggy alpha versions of the editor, or a corrupt install etc.

The error further down:

Received signal SIGSEGV

Is just a generic error indicating that something tried to access memory it shouldn't. It's probably a symptom of the first issue.

I've done some googling on this error:

Got a UNKNOWN while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application.

It seems like it indicates something very wrong inside the Mono runtime, all the mentions of it I could find were associated with things like buggy alpha versions of the editor, or a corrupt install etc.

The error further down:

Received signal SIGSEGV

Is just a generic error indicating that something tried to access memory it shouldn't. It's probably a symptom of the first issue.

I see. If it is a problem with Mono, I can build it with IL2CPP. I'm also updating to the latest version of Unity, so if it was a corrupt install, that should fix it.

I've done some googling on this error:

Got a UNKNOWN while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application.

It seems like it indicates something very wrong inside the Mono runtime, all the mentions of it I could find were associated with things like buggy alpha versions of the editor, or a corrupt install etc. The error further down:

Received signal SIGSEGV

Is just a generic error indicating that something tried to access memory it shouldn't. It's probably a symptom of the first issue.

I see. If it is a problem with Mono, I can build it with IL2CPP. I'm also updating to the latest version of Unity, so if it was a corrupt install, that should fix it.

It still crashes the editor. I'll try building the project with IL2CPP and see.

@Uralstech @martindevans @AsakusaRinne

I am having the same issues and I found fixes to some:

Unity crashes

Got a UNKNOWN while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application.

This can be fixed by compiling older llama.cpp revisions that are closer in time to current LLamaSharp release. I got the same error when compiling master, but no crash when using 178b185 (tag: b1187) k-quants : fix zero-weight guard in Q6_K (ref #3040) or other commits from git log --oneline --since="2023-09-05" --until="2023-09-06"

I also noticed that when linking to llama.cpp master, ModelParams class gets mapped incorrectly (e.g. context size gets mapped to cuda device id in my case).

Garbled output

I was able to generate correct output by using context.NativeHandle.Eval directly and avoiding llama.cpp internal cache (n_past parameter), so it looks like a problem with past key values cache inside the llama.cpp (or am I misunderstanding this functionality?).

Posting the MonoBehaviour to test this below. Here are the steps to setup the project (making it detailed for people not familiar with Unity):

NugetForUnity to install LLamaSharp
compiled DLLs from llama.cpp commit 178b185
Dropped llama.cpp DLLs directly into unity project hierarchy (via editor)
Restarted the editor (for magic some reason it needs restart to function properly).
Moved mistral-7b-v0.1.Q4_K_M.gguf model to Assets/StreamingAssets/llama folder in the project hierarchy
Created empty game object, pressed Add Component in inspector and added the LLamaSharpTestScript from below
Right click menu context menu on the component to call Generate Test function.

LLamaSharpTestScript.cs

using UnityEngine;
using LLama;
using LLama.Native;
using LLama.Common;
using System.Collections.Generic;
using System;
using System.Runtime.InteropServices;
using System.Linq;
using System.Text;

public class LLamaSharpTestScript : MonoBehaviour
{
    [ContextMenu("Generate Test")]
    public void GenerateTest()
    {
        string modelPath = Application.streamingAssetsPath + "/llama/mistral-7b-v0.1.Q4_K_M.gguf";
        var prompt = "### Human: Hi!\n### Assistant: Hello, how can I help you?\n### Human: say 'this is a test'\n### Assistant:";
        // Load a model
        var parameters = new ModelParams(modelPath)
        {
            ContextSize = 1024,
            Seed = 1337,
            GpuLayerCount = 5,
        };
        var inferenceParams = new InferenceParams()
        {
            TokensKeep = 128,
            MaxTokens = 32,
            TopK = -1,
            Temperature = 0,
        };
        using var model = LLamaWeights.LoadFromFile(parameters);

        var context = new LLamaContext(model, parameters);

        var generatedWithPast = GenerateFromNative(context, prompt, 32, 1, -1, 0.0f, 2);
        var generated = GenerateNoPastNaive(context, prompt, 32, -1, 0.0f, 2);

        context.Dispose();
        var instructExecutor = new StatelessExecutor(model, parameters);
        var sb = new StringBuilder();
        foreach (var tok in instructExecutor.Infer(prompt, inferenceParams: inferenceParams))
        {
            sb.Append(tok);
        }
        Debug.Log($"Generated with past: {generatedWithPast}\n Generated without past: {generated}\n Generated with executor: {sb.ToString()}");

    }

    private string GenerateFromNative(
        LLamaContext context,
        string prompt, int maxNewTokens = 1024,
        int window = 1, int topK = -1,
        float temperature = 1, int eosToken = 2
    )
    {
        var encoded = context.Tokenize(prompt);
        var idsArray = encoded.Select(x => (uint)x).ToArray();
        int n_past = 0;
        var generated = new List<uint>();
        var idsLast = idsArray.Last();
        var idsArrayWithoutLast = idsArray.Take(idsArray.Length - 1).ToArray();
        // Consume prompt tokens
        for (var cur = 0; cur < idsArrayWithoutLast.Length; cur += window)
        {
            var windowIds = idsArrayWithoutLast.Skip(cur).Take(window).ToArray();
            var _ = ComputeLogits(context, windowIds, n_past);
            n_past += windowIds.Length;
        }
        // Generate one-by-one until EOS token
        for (var cur = 0; cur < maxNewTokens; cur++)
        {
            var inpIds = new uint[] { idsLast };
            var logits = ComputeLogits(context, inpIds, n_past);
            var nextToken = Sample(logits, topK, temperature);
            if (nextToken == eosToken)
            {
                break;
            }
            generated.Add(nextToken);
            n_past += 1;
        }
        return context.DeTokenize(generated.Select(x => (int)x).ToArray());
    }

    private string GenerateNoPastNaive(
        LLamaContext context,
        string prompt, int maxNewTokens = 1024,
        int topK = -1, float temperature = 0,
        int eosToken = 2
    )
    {
        var encoded = context.Tokenize(prompt);
        var idsArray = encoded.Select(x => (uint)x).ToArray();
        var generated = new List<uint>();
        for (int cur = 0; cur < maxNewTokens; cur++)
        {
            var logits = ComputeLogits(context, idsArray);
            var nextToken = Sample(logits, topK, temperature);
            if (nextToken == eosToken)
            {
                break;
            }
            generated.Add(nextToken);
            idsArray = idsArray.Append(nextToken).ToArray();
        }
        return context.DeTokenize(generated.Select(x => (int)x).ToArray());
    }

    private uint Sample(float[] logits, int topK = -1, float temperature = 1)
    {
        var probs = Softmax(logits);
        var topKProbs = probs.Select((x, i) => new { x, i }).OrderByDescending(x => x.x).Take(topK > 0 ? topK : probs.Length);
        if (temperature == 0)
        {
            return (uint)topKProbs.First().i;
        }
        var topKProbsArray = topKProbs.Select(x => x.x).ToArray();
        var topKProbsSum = topKProbsArray.Sum();
        var topKProbsNormalized = topKProbsArray.Select(x => x / topKProbsSum).ToArray();
        var topKProbsCumSum = topKProbsNormalized.Select((x, i) => topKProbsNormalized.Take(i + 1).Sum()).ToArray();
        var random = UnityEngine.Random.value;
        var index = Array.FindIndex(topKProbsCumSum, x => x > random);
        return (uint)topKProbs.ElementAt(index).i;
    }

    private float[] ComputeLogits(LLamaContext context, uint[] idsArray, int n_past = 0)
    {
        var ids = MemoryMarshal.Cast<uint, int>(new ReadOnlySpan<uint>(idsArray));
        var ok = context.NativeHandle.Eval(ids, n_past, 8);
        if (ok)
        {
            var logits = context.NativeHandle.GetLogits();
            var logitsArray = logits.ToArray();
            return logitsArray;
        }
        throw new Exception("Eval failed");
    }

    private float[] Softmax(float[] logits)
    {
        var max = logits.Max();
        var exps = logits.Select(x => Math.Exp(x - max));
        var sum = exps.Sum();
        var softmax = exps.Select(x => (float)(x / sum));
        return softmax.ToArray();
    }
}

Output is:

Generated with past:  This This This This::::::::::::::::::::::::::::
 Generated without past:  This is a test
### Human: say 'this is a test'
### Assistant: This is a test
### Human: say 'this is a
 Generated with executor: 
###

::

================

================================================================================================================================================================joinjoinjoinjoinjoin

Am I using n_past correctly? Any ideas why it could be happening?

This can be fixed by compiling older llama.cpp revisions that are closer in time to current LLamaSharp release I also noticed that when linking to llama.cpp master, ModelParams class gets mapped incorrectly

Each version of LLamaSharp works with exactly one version of llama.cpp. The llama.cpp API isn't stable, so there's absolutely no flexibility in terms of using other versions without reviewing and fixing the corresponding C# code! If you're not using the correct llama.cpp version that could cause all kinds of weird behaviour (if you're lucky; a hard crash).

This can be fixed by compiling older llama.cpp revisions that are closer in time to current LLamaSharp release I also noticed that when linking to llama.cpp master, ModelParams class gets mapped incorrectly

Each version of LLamaSharp works with exactly one version of llama.cpp. The llama.cpp API isn't stable, so there's absolutely no flexibility in terms of using other versions without reviewing and fixing the corresponding C# code! If you're not using the correct llama.cpp version that could cause all kinds of weird behaviour (if you're lucky; a hard crash).

Where can I find the exact version from the release?

The readme has commit hashes in the Installation section for each released version. Those are the commits in the llama.cpp repo.

If you're using an unreleased version you should look back in history to the last time the DLL files were changed, that commit should mention the commit hash (if not have a look at the corresponding PR notes).

The readme has commit hashes in the Installation section for each released version. Those are the commits in the llama.cpp repo.

If you're using an unreleased version you should look back in history to the last time the DLL files were changed, that commit should mention the commit hash (if not have a look at the corresponding PR notes).

I will try running this with correct version tomorrow.

Also on a side note, the behavior with official LLamaSharp.Backend.CPU is identical.

Issue persists, on v0.6.0 and llama.cpp commit from the readme :( Output is the same as above.

@Uralstech @martindevans @AsakusaRinne

I am having the same issues and I found fixes to some:

Unity crashes

Got a UNKNOWN while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application.

I also noticed that when linking to llama.cpp master, ModelParams class gets mapped incorrectly (e.g. context size gets mapped to cuda device id in my case).

Garbled output

Posting the MonoBehaviour to test this below. Here are the steps to setup the project (making it detailed for people not familiar with Unity):

NugetForUnity to install LLamaSharp
compiled DLLs from llama.cpp commit 178b185
Dropped llama.cpp DLLs directly into unity project hierarchy (via editor)
Restarted the editor (for magic some reason it needs restart to function properly).
Moved mistral-7b-v0.1.Q4_K_M.gguf model to Assets/StreamingAssets/llama folder in the project hierarchy
Created empty game object, pressed Add Component in inspector and added the LLamaSharpTestScript from below
Right click menu context menu on the component to call Generate Test function.

LLamaSharpTestScript.cs

using UnityEngine;
using LLama;
using LLama.Native;
using LLama.Common;
using System.Collections.Generic;
using System;
using System.Runtime.InteropServices;
using System.Linq;
using System.Text;

public class LLamaSharpTestScript : MonoBehaviour
{
    [ContextMenu("Generate Test")]
    public void GenerateTest()
    {
        string modelPath = Application.streamingAssetsPath + "/llama/mistral-7b-v0.1.Q4_K_M.gguf";
        var prompt = "### Human: Hi!\n### Assistant: Hello, how can I help you?\n### Human: say 'this is a test'\n### Assistant:";
        // Load a model
        var parameters = new ModelParams(modelPath)
        {
            ContextSize = 1024,
            Seed = 1337,
            GpuLayerCount = 5,
        };
        var inferenceParams = new InferenceParams()
        {
            TokensKeep = 128,
            MaxTokens = 32,
            TopK = -1,
            Temperature = 0,
        };
        using var model = LLamaWeights.LoadFromFile(parameters);

        var context = new LLamaContext(model, parameters);

        var generatedWithPast = GenerateFromNative(context, prompt, 32, 1, -1, 0.0f, 2);
        var generated = GenerateNoPastNaive(context, prompt, 32, -1, 0.0f, 2);

        context.Dispose();
        var instructExecutor = new StatelessExecutor(model, parameters);
        var sb = new StringBuilder();
        foreach (var tok in instructExecutor.Infer(prompt, inferenceParams: inferenceParams))
        {
            sb.Append(tok);
        }
        Debug.Log($"Generated with past: {generatedWithPast}\n Generated without past: {generated}\n Generated with executor: {sb.ToString()}");

    }

    private string GenerateFromNative(
        LLamaContext context,
        string prompt, int maxNewTokens = 1024,
        int window = 1, int topK = -1,
        float temperature = 1, int eosToken = 2
    )
    {
        var encoded = context.Tokenize(prompt);
        var idsArray = encoded.Select(x => (uint)x).ToArray();
        int n_past = 0;
        var generated = new List<uint>();
        var idsLast = idsArray.Last();
        var idsArrayWithoutLast = idsArray.Take(idsArray.Length - 1).ToArray();
        // Consume prompt tokens
        for (var cur = 0; cur < idsArrayWithoutLast.Length; cur += window)
        {
            var windowIds = idsArrayWithoutLast.Skip(cur).Take(window).ToArray();
            var _ = ComputeLogits(context, windowIds, n_past);
            n_past += windowIds.Length;
        }
        // Generate one-by-one until EOS token
        for (var cur = 0; cur < maxNewTokens; cur++)
        {
            var inpIds = new uint[] { idsLast };
            var logits = ComputeLogits(context, inpIds, n_past);
            var nextToken = Sample(logits, topK, temperature);
            if (nextToken == eosToken)
            {
                break;
            }
            generated.Add(nextToken);
            n_past += 1;
        }
        return context.DeTokenize(generated.Select(x => (int)x).ToArray());
    }

    private string GenerateNoPastNaive(
        LLamaContext context,
        string prompt, int maxNewTokens = 1024,
        int topK = -1, float temperature = 0,
        int eosToken = 2
    )
    {
        var encoded = context.Tokenize(prompt);
        var idsArray = encoded.Select(x => (uint)x).ToArray();
        var generated = new List<uint>();
        for (int cur = 0; cur < maxNewTokens; cur++)
        {
            var logits = ComputeLogits(context, idsArray);
            var nextToken = Sample(logits, topK, temperature);
            if (nextToken == eosToken)
            {
                break;
            }
            generated.Add(nextToken);
            idsArray = idsArray.Append(nextToken).ToArray();
        }
        return context.DeTokenize(generated.Select(x => (int)x).ToArray());
    }

    private uint Sample(float[] logits, int topK = -1, float temperature = 1)
    {
        var probs = Softmax(logits);
        var topKProbs = probs.Select((x, i) => new { x, i }).OrderByDescending(x => x.x).Take(topK > 0 ? topK : probs.Length);
        if (temperature == 0)
        {
            return (uint)topKProbs.First().i;
        }
        var topKProbsArray = topKProbs.Select(x => x.x).ToArray();
        var topKProbsSum = topKProbsArray.Sum();
        var topKProbsNormalized = topKProbsArray.Select(x => x / topKProbsSum).ToArray();
        var topKProbsCumSum = topKProbsNormalized.Select((x, i) => topKProbsNormalized.Take(i + 1).Sum()).ToArray();
        var random = UnityEngine.Random.value;
        var index = Array.FindIndex(topKProbsCumSum, x => x > random);
        return (uint)topKProbs.ElementAt(index).i;
    }

    private float[] ComputeLogits(LLamaContext context, uint[] idsArray, int n_past = 0)
    {
        var ids = MemoryMarshal.Cast<uint, int>(new ReadOnlySpan<uint>(idsArray));
        var ok = context.NativeHandle.Eval(ids, n_past, 8);
        if (ok)
        {
            var logits = context.NativeHandle.GetLogits();
            var logitsArray = logits.ToArray();
            return logitsArray;
        }
        throw new Exception("Eval failed");
    }

    private float[] Softmax(float[] logits)
    {
        var max = logits.Max();
        var exps = logits.Select(x => Math.Exp(x - max));
        var sum = exps.Sum();
        var softmax = exps.Select(x => (float)(x / sum));
        return softmax.ToArray();
    }
}

Output is:

Generated with past:  This This This This::::::::::::::::::::::::::::
 Generated without past:  This is a test
### Human: say 'this is a test'
### Assistant: This is a test
### Human: say 'this is a
 Generated with executor: 
###

::

================

================================================================================================================================================================joinjoinjoinjoinjoin

Am I using n_past correctly? Any ideas why it could be happening?

Hello! I just tried your script in Unity, without changing anything, and it gives proper replies now! Thanks!

But, when I increase the MaxTokens in inferenceParams, the editor still crashes. The crash logs are similar to previous crashes.

Wait, even when using GenerateFromNative? (Line prefixed with Generated with past: in the debug console)

GenerateNoPastNaive (Generated without past: prefix in the debug console) is very slow because it re-computes all the tokens for every new token.

Wait, even when using GenerateFromNative? (Line prefixed with Generated with past: in the debug console)

GenerateNoPastNaive (Generated without past: prefix in the debug console) is very slow because it re-computes all the tokens for every new token.

GenerateFromNative does not work, "GenerateNoPastNaive" does. Ignoring the slow speed, it is still an improvement.

I am sorry, I am being dumb. Everything works on v0.6.0. It's just my script was buggy, and I was using native handle instead of LLamaContext.Eval that returns correct number of n_past tokens to use.

@Uralstech, here is the updated MonoBehaviour that works (Ignore ModelParams, they are probably wrong).

using UnityEngine;
using System.Collections.Generic;
using System;
using System.Linq;
using LLama;
using LLama.Native;
using LLama.Common;

public class LLamaSharpTestScript : MonoBehaviour
{

    [ContextMenu("Generate Test")]
    public void GenerateTest()
    {
        string modelPath = Application.streamingAssetsPath + "/llama/mistral-7b-v0.1.Q4_K_M.gguf";
        var prompt = "### Human: Hi!\n### Assistant: Hello, how can I help you?\n### Human: say 'this is a test'\n### Assistant:";
        // Load a model
        var parameters = new ModelParams(modelPath)
        {
            ContextSize = 32768,
            Seed = 1337,
            GpuLayerCount = 16,
            BatchSize = 128
        };
        using var model = LLamaWeights.LoadFromFile(parameters);
        var context = new LLamaContext(model, parameters);

        var generatedWithPast = GenerateFromNative(context, prompt, 32, -1, 0.0f);
        // var generated = GenerateNoPastNaive(context, prompt, 32, -1, 0.0f, 2);

        Debug.Log($"Generated with past: {generatedWithPast}");
    }

    private string GenerateFromNative(
        LLamaContext context,
        string prompt, int maxNewTokens = 1024, int topK = -1,
        float temperature = 1
    )
    {
        var idsArray = context.Tokenize(prompt);
        int n_past = 0;
        float[] logits;

        var generated = new List<int>();
        var idsLast = idsArray.Last();

        var idsArrayWithoutLast = idsArray.Take(idsArray.Length - 1).ToArray();

        (logits, n_past) = ComputeLogits(context, idsArrayWithoutLast, n_past);

        // Generate one-by-one until EOS token
        for (var cur = 0; cur < maxNewTokens; cur++)
        {
            var inpIds = new int[] { idsLast };
            (logits, n_past) = ComputeLogits(context, inpIds, n_past);
            var nextToken = Sample(logits, topK, temperature);
            if (nextToken == NativeApi.llama_token_eos(context.NativeHandle))
            {
                break;
            }
            idsLast = nextToken;
            generated.Add(nextToken);
        }
        return context.DeTokenize(generated.Select(x => (int)x).ToArray());
    }

    private int Sample(float[] logits, int topK = -1, float temperature = 1)
    {
        var probs = Softmax(logits);
        var topKProbs = probs.Select((x, i) => new { x, i }).OrderByDescending(x => x.x).Take(topK > 0 ? topK : probs.Length);
        if (temperature == 0)
        {
            return topKProbs.First().i;
        }
        var topKProbsArray = topKProbs.Select(x => x.x).ToArray();
        var topKProbsSum = topKProbsArray.Sum();
        var topKProbsNormalized = topKProbsArray.Select(x => x / topKProbsSum).ToArray();
        var topKProbsCumSum = topKProbsNormalized.Select((x, i) => topKProbsNormalized.Take(i + 1).Sum()).ToArray();
        var random = UnityEngine.Random.value;
        var index = Array.FindIndex(topKProbsCumSum, x => x > random);
        return topKProbs.ElementAt(index).i;
    }

    private (float[] logits, int past_tokens) ComputeLogits(LLamaContext context, int[] idsArray, int n_past = 0)
    {
        var newPastTokens = context.Eval(idsArray, n_past);

        var logits = context.NativeHandle.GetLogits();
        var logitsArray = logits.ToArray();
        return (logitsArray, newPastTokens);
    }

    private float[] Softmax(float[] logits)
    {
        var max = logits.Max();
        var exps = logits.Select(x => Math.Exp(x - max));
        var sum = exps.Sum();
        var softmax = exps.Select(x => (float)(x / sum));
        return softmax.ToArray();
    }
}

I will double check with LLamaSharp executors later.

Thanks for the help!

Just a kind request, is anyone here willing to write a short blog/document about using LLamaSharp in unity? We have little experience with unity and are not fully aware of the gap between unity and dotnet core runtime, so that there's no documents about deploying on unity yet. However according to the issues, unity is one of the most important parts of the application. I'll appreciate it if anyone could help. :)

Hello, I periodically look at what’s going on with you and thought I’d connect later, when your version changes less often.

I have a number of questions. All my reasoning has nothing to do with the project itself, but only with its use in Unity.

Your version can be divided into several parts. The first and most valuable for Unity are the native functions that give access to the llamacpp backend. It is this part that is most interesting for Unity. It allows you to implement any usage strategy the user needs.
A system of asynchronous functions for implementing various dialogue tasks, built on the construction of text templates and saving history. This part is simply not suitable for Unity, since it bypasses the traditional Unity implementation in components through coroutines (IEnumerator, yield). It's actually easier to make them directly in Unity than to use yours.
Using a semantic kernel. I understand that using a pre-made concept gives you a ton of examples and tutorials that will be created by the community. This is a huge benefit and extremely valuable work. But this is a proprietary Microsoft system, which can create problems. Considering that all that is really needed from it is a semantic hash and a semantic proximity matching function. Having them, you can very simply build a vector database on any available basis (for example, LiteSQL available in Unity). Semantic functions are the same text substitution wrapped in asynchronous output. I haven't looked closely yet, but in your SemanticKernelMemory example, you are using the same llama model as the source of the semantic hash.
```
     var parameters = new ModelParams(modelPath)
     {
         EmbeddingMode = true
     };

     using var model = LLamaWeights.LoadFromFile(parameters);
     var embedding = new LLamaEmbedder(model, parameters);
```

As I understand it, this is independent of Microsoft and can be used as a source of semantic hash independently?

Am I understanding this correctly?

If this is so, then by the nature of the embedding vector, the relevance function will be the cosine distance between the vectors. If this is the case, then it is better not to use the semantic kernel. This kernel will only drag in a bunch of dlls and turn any debugging into a nightmare. It is worth considering that reading files of different formats, such as pdf, is already available in Unity.

I wrote all this to make it clearer to you that what you are doing is maximally suitable for creating web servers that solve specific AI problems.

What is needed for Unity is a set of basic, non-asynchronous functions that provide the ability to build your own components and pipelines within Unity. At the same time, using resources as efficiently as possible.

For example, the question is still unclear to me. Can I use the same model in both embedding and inference mode without rebooting? Is it possible to change the operating mode without rebooting? Otherwise, you need to have two loaded models, which will put a greater load on resources. If you noticed, Microsoft's semantic kernel, like Long-Chein, often use text2vec models for embedding, which are much more compact and faster (not better) than llama. The largest is 1.2 Gig.

All this leads to the fact that it might be better to have some kind of simplified branch of your project for Unity? The entire project simply includes parts that are unnecessary and will only create additional problems. Most often in Unity, these will be tasks of creating game logic and content, which still cannot be executed in the true sense asynchronously, since they will be a dependent series of requests. There is asynchrony, it’s just processing that doesn’t block the WindowsMessage loop.

The whole point is to use them natively in Unity, and not on the server. Since there is no point in a local server, and accessing external services will cause problems with their reliability or with their payment. The goal is to get a working mechanism on small models.

Although I understand the theory, I have not studied the current implementations in detail, so my reasoning may be completely wrong!

What do you think about it?

@Xsanf Hey, thank you very much for taking the time out of caring for the children to provide these suggestions :)

Your version can be divided into several parts. The first and most valuable for Unity are the native functions that give access to the llamacpp backend. It is this part that is most interesting for Unity. It allows you to implement any usage strategy the user needs.

When starting this project, what I pursued is that making other C# developers deploy LLM with less code, which I think attract ed many developers without much knowledge about internal mechanism of LLaMA. However, as you said, it sacrificed some flexibility, even though native apis are all public now. There're two questions I want to further ask about:

Do the wrappers such as ChatSession and LLamaExecutor make it more difficult to use LLamaSharp in unity (even if just make it work)? If so, could you please further explain the difference with common dotnet runtime apps here?
Let's assume that there's already a package named LLamaSharp.Native, which contains only the APIs ported from c++, would unity developers prefer building an text2text function from scratch to using high-level APIs in current LLamaSharp package?

A system of asynchronous functions for implementing various dialogue tasks, built on the construction of text templates and saving history. This part is simply not suitable for Unity, since it bypasses the traditional Unity implementation in components through coroutines (IEnumerator, yield). It's actually easier to make them directly in Unity than to use yours.

It's a good idea. I didn't notice that using IEnumerator and yield conflicts with unity features. And I think this tip could partially answer the second question above. Taking this into account, maybe we should keep a simplified version of the text2text and text2vec APIs.

If this is so, then by the nature of the embedding vector, the relevance function will be the cosine distance between the vectors. If this is the case, then it is better not to use the semantic kernel. This kernel will only drag in a bunch of dlls and turn any debugging into a nightmare.

The semantic-kernel is supported via an extension package named LLamaSharp.semantic-kernel. I'm a little confusing because if it's not desired, users can just not install it. When it comes to vector search of documents, yes, there're alternatives of semantic-kernel.

What is needed for Unity is a set of basic, non-asynchronous functions that provide the ability to build your own components and pipelines within Unity. At the same time, using resources as efficiently as possible. The whole point is to use them natively in Unity, and not on the server. Since there is no point in a local server, and accessing external services will cause problems with their reliability or with their payment. The goal is to get a working mechanism on small models.

I think this is the point for us to better support unity. Unity support actually contains two parts. A) Making users able to run LLamaSharp in unity. B) Making it easier for users to build applications in unity.

Do you think the current release mode (main package + backend package + extension package) make it difficult for unity users on the side of A? I think it will affect the way for us to further support unity in the future. If the answer is yes, I think adding an extra package will be better, otherwise adding a set of APIs in the current package is okay.

For example, the question is still unclear to me. Can I use the same model in both embedding and inference mode without rebooting? Is it possible to change the operating mode without rebooting?

I think one model for both embedding and inference mode is supported now because of the introduce of LLamaWeight. @martindevans Could you please confirm whether I was right or wrong? If change the operating mode refers to switch between instruct mode and chat mode, the answer is yes.

Thanks again for you suggestions. Hope the applications in unity could be better supported in the future. :)

Could you please confirm whether I was right or wrong?

I haven't tried this but yes I think you should be able to load one set of weights and use it for everything.

Your version can be divided into several parts. The first and most valuable for Unity are the native functions that give access to the llamacpp backend.

LLamaSharp is basically split into multiple layers at the moment. From your description it sounds like you want to use the first 2 layers.

Lowest Level

NativeApi contains methods which are directly ported across from C++.

Resources which needs to be freed (e.g. a context) will be wrapped in a handle (e.g. SafeLLamaContextHandle). The handle provides an extremely thin wrapper over the native methods, for example instead of NativeApi.llama_get_logits(SafeLLamaContextHandle) you can use SafeLLamaContextHandle.GetLogits.

Middle Level

This level provides more idiomatic C# objects which represent the low level llama.cpp capabilities. e.g. LLamaContext is a higher level (and safer to use) wrapper around SafeLLamaContextHandle (and LLamaWeights is the same to SafeLLamaModelHandle). The intention here is that all of the higher level stuff can be built with this layer.

Top Level

Executors, text transforms, history transforms etc all exist at the highest level. They exist to make it as easy as possible to assemble a system that responds to text prompts without the user having to understand all of the deeper llama.cpp implementation details.

Async...

We recently decided to make all the high level executors async. That allows any IO that's needed, such as loading.saving state, to be async. Possibly more importantly it means that the executor can yield while the LLM is being evaluated, instead of blocking (which would be disastrous in Unity, causing dropped frames). As far as I'm aware it should all work inside Unity (although not with the native Unity Job system, you'd have to build your own executor using the lower layer components for that).

AsakusaRinne and martindevans Thank you very much for the quick response. I won't lie, it's very nice.

Let me clarify again. I’m not talking about the LLamaSharp project, I’m talking specifically about the specifics of Unity. And once again I will confirm my opinion, you are doing everything right within the framework of LLamaSharp. Your entire approach is appropriate to the task. Everything is separated and can be used separately for the user's specific purposes.

I'll try to answer consistently.

Do the wrappers such as ChatSession and LLamaExecutor make it more difficult to use LLamaSharp in unity (even if just make it work)? If so, could you please further explain the difference with common dotnet runtime apps here?

No, the presence of asynchronous functions does not interfere with development. If you remember in my example (based on version 0.3), I was using your asynchronous function, streaming output via yield

IEnumerator DoModel()
     {
         var outputs = _session.Chat(question, encoding: "UTF-8");
         string tmp = ChatDisplayOutput.text;
         buff = string.Empty;
         foreach (var output in outputs)
         {
             buff += output;
             ChatDisplayOutput.text = tmp + buff;
             yield return new WaitForSeconds(.1f);
         }
     }

The issue is that this is only useful for chat mode. In Unity, most often this will be game logic where streaming output is useless. So it will be a non-asynchronous call placed on a separate thread. Most often, this requires the formation of context from different sources. This includes information about characters and objects on the stage, history of interactions, etc. You need to collect context while minimizing its size from different sources in the game. This applies not only to games, but also to any intelligent agents. But this is not the task of your project, since it strongly depends on what kind of architecture the user will create. But undoubtedly the task of implementing the vector representation of texts as associative memory will at least partially rely on your native API.

Let's assume that there's already a package named LLamaSharp.Native, which contains only the APIs ported from c++, would unity developers prefer building an text2text function from scratch to using high-level APIs in current LLamaSharp package?

The question is not entirely correct, preferences will be differents. If for me it is preferable to build the basic mechanics myself, since I understand well what I am doing, then for someone it will be more preferable to use the same semantic kernel from Microsoft. Because it is easier to find examples of use for it. Actually, it would be nice to have a simple example of using a model call in the mode of obtaining an embedding vector and functions for calculating the proximity of vectors that do not rely on the semantic kernel. And an example of using one load of model weights in vector calculation mode and inference mode, without a second copy. This will close the issue completely. By allowing the user to choose what is preferable to him. The main thing is that it remains possible to implement such functions through the basic API

Do you think the current release mode (main package + backend package + extension package) make it difficult for unity users on the side of A? I think it will affect the way for us to further support unity in the future. If the answer is yes, I think adding an extra package will be better, otherwise adding a set of APIs in the current package is okay.

Once again, you are fine. You allow not to use what the user considers unnecessary. In Unity itself this can be used. The only question is that most often this will be needed only to implement chat, and this is not the main task in Unity. And therefore, if problems of transferring into Unity arise with precisely such parts, then it would probably be more reasonable to ignore them, simply not transferring them to Unity And have a separate shortened version for unity, which simply won't contain the problematic but optional parts.

In Unity, the task is game logic (agent logic). And this is inference, embedding, vector comparison. The implementation of semantic functions through text substitutions is generally best left entirely to the will of the Unity user, since only he understands exactly how he wants to form the context.

I think one model for both embedding and inference mode is supported now because of the introduce of LLamaWeight. @martindevans Could you please confirm whether I was right or wrong? If change the operating mode refers to switch between instruct mode and chat mode, the answer is yes.

I haven't checked it yet, but judging by the description, this is what need. I will definitely try to check the 0.6 version in Unity later. Just as I wrote, I have grandchildren on me)) Now one is sitting and snotting, and tomorrow they will bring another one))

Мартиндеванс

LLamaSharp is basically split into multiple layers at the moment. From your description it sounds like you want to use the first 2 layers.

Yes, you are absolutly right. I use both the native API itself and abstractions over context and parameters. It is very comfortable. I used your top-level abstractions as a simple chat example and it all works (I hope this hasn’t changed in 0.6). So, this does not interfere with Unity itself, it’s just that in some cases it is redundant.

We recently decided to make all the high level executors async. That allows any IO that's needed, such as loading.saving state, to be async. Possibly more importantly it means that the executor can yield while the LLM is being evaluated, instead of blocking (which would be disastrous in Unity, causing dropped frames). As far as I'm aware it should all work inside Unity (although not with the native Unity Job system, you'd have to build your own executor using the lower layer components for that).

Once again, your decisions are absolutely clear and justified for LLamaSharp. Unity has its own methods for non-blocking execution, but your implementation does not interfere with this in any way.

The main thing is to ensure the availability, at a low level, of capabilities - interference, embedding (semantic hash) and calculation of semantic distance. This is absolutely enough to implement any processing of intelligent agents. Everything else is an undoubted convenience, but like all conveniences, they are limitations. Therefore, only the user can decide what is more convenient for him. It is clear that there is a huge benefit from being able to quantize and use LoRA. If and when LoRA training becomes available. All this is priceless, but these are already tools for developing the models themselves. In game logic, three requirements remain unchanged - inference, embedding, semantic distance.

Somehow I forgot to mention, behind all these discussions. Although coroutines are more common in Unity, async/await and the Task and Task<TResult> types. are supported since Unity 2017. So, in principle, there should be no problems in using asynchronous functions.

I don’t know if this is appropriate here, but I haven’t had anyone to discuss this for quite some time)). I’ll just express my opinion, formed by my 66 years, most of which I worked in one way or another on algorithms close to the topic of AI. The brain is a parallel correlator of signals in the window of events, displaying, with its characteristic distortions, the world here and now, in the form of entities (concepts) identified by it, presented through the strength of connections between elements. It does not have individual thoughts because it does not have a serialized thread. In the process of evolution, for the exchange of signals between school animals, he created a mechanism for serializing his parallel state into a stream of symbols, based on attention, highlighting part of the general state. As it evolved, this allowed speech and serialized thinking to emerge. In linguistics there is the concept of Concept and Detonation. Detonat is a label attached to the concept. A concept is a group of states that conveys its semantics through connections with other concepts (semantic network). Personality is a serializer that can express, as a sequence of detonations, a parallel representation of concepts highlighted by attention. The personality does not make decisions, it only bears witness to them, translating a deep parallel structure into a flow of Detonations (thoughts). Language is a serialized stream of Detonations created by a person based on attention and a system of Concepts that form a semantic network. During the learning process, LLM analyzes the flow of Detonators (texts), calculating connectivity in the context window. This allows her to recreate the system of Concepts(In hidden layers) that gave rise to this flow of Detonations. Those. deserialize the identity. Naturally, this is not a 100% recovery, but only an approximation. But LLM is a personality model with all archetypes extracted from texts. Not a specific person, but rather some generalized model(based on all texts). Without mechanisms of goal setting or personality accentuation, only a system of Concepts. Goal setting and an accentuated personality are created by the current context and attention. Because context and attention are what connect well during LLM or LoRA training.

This is why it is so important to take into account that English is a projective language and the prediction of the next character relies entirely on the left context. This is the GPT model. A huge number of languages are non-projective and GPT cannot take into account the influence of the right context. two-way BERT is inconvenient for generation and so far the most likely candidate for a generative model that takes into account two-way context is T5. So, pay attention to T5 whenever possible. Especially if your language is non-projective. Non-projective languages also contain a projective part, so GPT will extract something from them, but T5 will extract more Concepts from them.

The first successful attempt to create non-life (imitation), which in principle is difficult to distinguish from life.

Actually, it would be nice to have a simple example of using a model call in the mode of obtaining an embedding vector and functions for calculating the proximity of vectors that do not rely on the semantic kernel. And an example of using one load of model weights in vector calculation mode and inference mode, without a second copy. This will close the issue completely. By allowing the user to choose what is preferable to him. The main thing is that it remains possible to implement such functions through the basic API

Now StatelessExecutor supports genrating response for just once, please refer to this example. Though in the example it's wrote as a chat, the different loops of chat are independent. So that users could use it as a one-time job, with history recorded themselves (for example, a dialogue in the game). As for using one load of model weights in vector calculation mode and inference mode, I think it's okay with the introduce of LLamaWeight but I need to further confirm it. I have to admit that such examples is too few after a check, I'll add more examples. Thank you a lot for your suggestion.

Discussions about AI, NLP and T5 model...

On my personal side, I'm very much willing to hear such a deep-discussion of it. I was a researcher of the area of image processing studying for a master's degree and am working on AI infra (mainly distributed training system and high performance inference) now. I can't agree more that serialized thread is important for intelligence of AI models. On most of the tasks about a single image, AI has already defeats human beings or has a similar level, even about drawing image (stable diffusion). But for image series, which is mostly video, humans still have their own advantages. There has been many rumors about ChatGPT now. Many people think it's benefited from MOE and the biggest difficult is large scale distributed training (an unconfirmed claim is 10,000 GPUs). Some star researchers, like Lecun, are even more aggressive, saying self-regression models like GPT is leading humans into a wrong way to AGI. However, correct or wrong, OpenAI and ChatGPT are successful because they serves lots of people well. Though I'm not a researcher now, I'm always open for new things about AI (T5 is not a "new model" now but is innovative compared to so many GPT-architecture models). I'd like to help to write a repo for T5 inference in c/cpp and port it to C# or other languages. However, so far there's little evidence that T5 has the potential to be better than GPT models (maybe I'm wrong, I don't know much about it so I'm heavily dependent on evaluation results, but the open-source evaluations are mostly designed for GPT-like models). May I assume that T5 should already perform better on non-projective language than GPT-architecture models with the same scale of parameters?

It's late night here and I have work tomorrow, so my reply is limited. For the subsequent discussion, could you please open a discussion in this repo, or sending emails to me (AsakusaRinne@gmail.com)? I'm not sure if other people in this issue would find the message disturbing though I think it's really a good topic. :)

I've tested LLamaSharp v0.6.0 on Unity with @eublefar's new script and am getting 15-17 seconds for a reply, which is good for my laptop.

One thing about LlamaSharp's Unity compatibility - Unity is on C# 9, so there are some errors when first importing into unity:

Feature 'file-scoped namespace' is not available in C# 9.0. (This is there in many scripts)
These can be fixed by using block-scoped namespaces. LoraAdapter (IModelParams.cs)

Feature 'record structs' is not available in C# 9.0.

The feature 'primary constructors' is currently in Preview and unsupported.

A record cannot be readonly.
This can be fixed by making LoraAdapter a readonly struct.
public readonly struct LoraAdapter
{
public readonly string Path;
public readonly float Scale;

public LoraAdapter(string path, float scale)
{
    Path = path;
    Scale = scale;
}

}


Unity is targeting NetStandard 2.1, so I get these errors:
> FixedSizeQueue(int size, IEnumerable<T> data) (FixedSizeQueue.cs)
> 'IEnumerable<T>' does not contain a definition for 'TryGetNonEnumeratedCount' and no accessible extension method 'TryGetNonEnumeratedCount' accepting a first argument of type 'IEnumerable<T>' could be found.
- This can be fixed by adding a case in the preprocessor for NetStandard 2.1.
```cs
#if !NETSTANDARD2_0 && !NETSTANDARD2_1
            // Try to check the size without enumerating the entire IEnumerable. This may not be able to get the count,
            // in which case we'll have to check later
            if (data.TryGetNonEnumeratedCount(out var dataCount) && dataCount > size)
                throw new ArgumentException($"The max size set for the quene is {size}, but got {dataCount} initial values.");
#endif

ListExtensions.cs
'List' does not contain a definition for 'EnsureCapacity' and no accessible extension method 'EnsureCapacity' accepting a first argument of type 'List' could be found.
This can also be fixed by adding a case for NetStandard 2.1 in the preprocessor blocking the definition of EnsureCapacity.
#if NETSTANDARD2_0 || NETSTANDARD2_1
public static void EnsureCapacity<T>(this List<T> list, int capacity)
{
if (list.Capacity < capacity)
list.Capacity = capacity;
}
#endif
In general, Unity support will improve greatly if LLamaSharp targets NetStandard 2.1, even if in addition to NetStandard 2.0.
Just a kind request, is anyone here willing to write a short blog/document about using LLamaSharp in unity? We have little experience with unity and are not fully aware of the gap between unity and dotnet core runtime, so that there's no documents about deploying on unity yet. However according to the issues, unity is one of the most important parts of the application. I'll appreciate it if anyone could help. :)

I can make a small document of the changes I made to LLamaSharp to make it work in Unity.

I am pretty good in Unity and CSharp, but I have no idea how LLaMA CPP works... I can help in any way possible!

Also, somewhat irrelevant to this issue, would LLamaSharp work on Android and is there any plan to have an official runtime for Android?

I have yet to try building LLaMA CPP for Android and use it in LLamaSharp. I am mainly interested in using LLamaSharp to inference a LLaMA model for chat in Android apps.

@Uralstech Many thanks for your reply! We'll fix the compatibility problem you listed soon!

In general, Unity support will improve greatly if LLamaSharp targets NetStandard 2.1, even if in addition to NetStandard 2.0.

I think adding an extra runtime from the next release is okay if it helps on the compatibility with unity.

I can make a small document of the changes I made to LLamaSharp to make it work in Unity. I am pretty good in Unity and CSharp, but I have no idea how LLaMA CPP works... I can help in any way possible!

I'll really appreciate that! A blog/document about how to make LLamaSharp work with unity is enough. I think being stuck at the first step will depress lots of users.

Also, somewhat irrelevant to this issue, would LLamaSharp work on Android and is there any plan to have an official runtime for Android?

I'm pretty sure that llama.cpp works on Android but not sure if it's okay with dotnet runtime. In fact I'm mainly a cpp developer and C# is my interest, so that I don't have an idea if the same LLamaSharp code could work on Android. Recently I'm also doing some works on Android (see faster-rwkv, which supports android inference). Therefore I think I could help with the step of building llama.cpp but still need some knowledges about dotnet apps on Android plarform.

One thing about LlamaSharp's Unity compatibility - Unity is on C# 9, so there are some errors when first importing into unity:

You can use precompiled DLL from NuGet and then there is no need to modify the code. I also use NuGetForUnity for dependencies, but It should be trivial to just find and add dlls from those yourself.

@Xsanf @AsakusaRinne

Just a kind request, is anyone here willing to write a short blog/document about using LLamaSharp in unity?

I can clean up a demo project and write some README on how to start with LLAMASharp in Unity, if you want.

I also switched to using InteractiveExecutor from your README example. It works pretty good with UniTask - basically an async/await integration for Unity. await UniTask.SwitchToThreadPool(); is especially useful for offloading blocking computations to other threads.

But one big feature I don't know how to implement is support for multiple sequences. As far as i understand native API owns the KV cache and even has support for different sequence ids, but It's unclear how to use it from high level API.
Some usecases I am thinking about:

Having multiple characters and dialogues in the game without recomputing KV cache each time.
Having other utility promts to e.g. start actions or things like that (i.e. precompute prompt KV and then reuse it over multiple executions without updating it).
Super fancy feature would be to load prefix-tuned KV cache for usecases above.

The closest thing I found is StatefulExecutorBase.GetStateData / StatefulExecutorBase.LoadStateData. Would this be an efficient approach to do this (is there a lot of memory allocations when passing data between native and bindings)? Is there any other way to handle this?

You can use precompiled DLL from NuGet and then there is no need to modify the code.

I'd definitely recommend taking this approach. It's going to be a lot easier than trying to maintain a fork of LLamaSharp which tries to pull it back 2 entire language versions!

NETSTANDARD2_1

Looks like we have a "hole" in our version support since all of the compatibility shims are written in NETSTANDARD2_0 blocks (that's sufficient for .NET Framework 4.5, but not Unity). @Uralstech if you'd like to make a PR adding NETSTANDARD2_1 support (i.e. replacing all the #ifdef blocks and adding a new target) I'd be happy to review and merge that :)

Edit: If you want to do something a little extra while doing this, it might be good to add something like this to ensure our #ifdef blocks never have this error again :)

would LLamaSharp work on Android

I don't have much experience working on Android, but I would expect LLamaSharp to work on Android. We already support ARM64 for MacOS, which doesn't require any architecture/platform specific code (just compatible binaries).

But one big feature I don't know how to implement is support for multiple sequences.

I've recently been working on this, check out the new Batched Decoding example to see a very rough prototype. It sounds like you have some interesting ideas here (to be honest I don't fully understand the new system yet), I've started a discussion here if you want to talk about it.

StatefulExecutorBase.GetStateData / StatefulExecutorBase.LoadStateData. Would this be an efficient approach to do this (is there a lot of memory allocations when passing data between native and bindings)?

I think that's the best option with LLamaSharp right now, but it is an expensive operation because it's a pretty big chunk of data! LLamaSharp itself isn't adding any overhead here, we're just allocating a big block of memory and asking llama.cpp to copy the state data into it. Batched decoding will definitely be better.

@AsakusaRinne, Here is the demo project https://github.com/eublefar/LLAMASharpUnityDemo btw.

One thing about LlamaSharp's Unity compatibility - Unity is on C# 9, so there are some errors when first importing into unity:

You can use precompiled DLL from NuGet and then there is no need to modify the code. I also use NuGetForUnity for dependencies, but It should be trivial to just find and add dlls from those yourself.

@Xsanf @AsakusaRinne

Just a kind request, is anyone here willing to write a short blog/document about using LLamaSharp in unity?

I can clean up a demo project and write some README on how to start with LLAMASharp in Unity, if you want.

I also switched to using InteractiveExecutor from your README example. It works pretty good with UniTask - basically an async/await integration for Unity. await UniTask.SwitchToThreadPool(); is especially useful for offloading blocking computations to other threads.

But one big feature I don't know how to implement is support for multiple sequences. As far as i understand native API owns the KV cache and even has support for different sequence ids, but It's unclear how to use it from high level API. Some usecases I am thinking about:

Having multiple characters and dialogues in the game without recomputing KV cache each time.

Having other utility promts to e.g. start actions or things like that (i.e. precompute prompt KV and then reuse it over multiple executions without updating it).

Super fancy feature would be to load prefix-tuned KV cache for usecases above.

The closest thing I found is StatefulExecutorBase.GetStateData / StatefulExecutorBase.LoadStateData. Would this be an efficient approach to do this (is there a lot of memory allocations when passing data between native and bindings)? Is there any other way to handle this?

From what I understand, I cannot add my own build of LLaMA CPP without using the source version of LLamaSharp - I am mainly interested in running a model on Android, and as there is no official backend for Android, I plan to build LLaMA CPP from source with Android support and use it in LLamaSharp.

Looks like we have a "hole" in our version support since all of the compatibility shims are written in NETSTANDARD2_0 blocks (that's sufficient for .NET Framework 4.5, but not Unity). @Uralstech if you'd like to make a PR adding NETSTANDARD2_1 support (i.e. replacing all the #ifdef blocks and adding a new target) I'd be happy to review and merge that :)

I have forked the project! I'll updated it with the changes as soon as possible.

From what I understand, I cannot add my own build of LLaMA CPP without using the source version of LLamaSharp - I am mainly interested in running a model on Android, and as there is no official backend for Android, I plan to build LLaMA CPP from source with Android support and use it in LLamaSharp.

I think you can tho. You just need to clone LLAMASharp project separately and cross compile it for arm64 target (or whatever the architecture you are targeting) the same you would do for llama.cpp.

If you want to use a custom compiled DLL just don't install a backend package (e.g. LLamaSharp.CPU) and make sure there's a libllama.dll/libllama.so in the directory yourself. That binary can be anything you want.

If you do that ensure that you use exactly the correct commit from llama.cpp! There's absolutely no compatibility from version to version.

eublefar Thank you very much for the demo. As I said, I'm not a big expert in Unity, so my example is just a provocation, which I hoped would attract the attention of more powerful developers. Unity is an extremely interesting segment for LLM.

A huge relief, because now there is no need to restart the project, everything ends normally. Based on your explanation, I rebuilt the project for the GPU. Everything works very quickly. I'll try to figure out how this version works.

@AsakusaRinne, Here is the demo project https://github.com/eublefar/LLAMASharpUnityDemo btw.

Thank you very much for your work! I've put the demo in the README. :)

I have made a pull request regarding the preprocessor directives for targeting newer versions of .NET Standard! I have only made the changes in the root LLamaSharp project, as I haven't explored the other parts.

Hi everyone, I really like the project and I want to contribute with what I can. Can we make a to do list on the github page? @eublefar @Uralstech

Hi everyone, I really like the project and I want to contribute with what I can. Can we make a to do list on the github page? @eublefar @Uralstech

Of course, you're always welcome! I made a project which contains TODO list just now, please refer to LLamaSharp Dev Project. You could also join our discord and I'll invite you to dev channel.

I'll close this one now, since it seems like the discussion is over. Feel free to re-open this or open new issues of course :)

SciSharp / LLamaSharp