SciSharp / LLamaSharp

A C#/.NET library to run LLM (πŸ¦™LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.58k stars 337 forks source link

Feature Request: Switch backends dynamically at runtime? #264

Open BrainSlugs83 opened 11 months ago

BrainSlugs83 commented 11 months ago

Right now, to switch backends, I have to uninstall one nuget package, and then reinstall another. -- This means that if I make an app, I would have to create different builds for every configuration.

Instead I'd rather just give someone a dropdown box in a UI and let them pick the backend, and have the app switch on the fly to whatever they have selected.

Ideally this would just be an enum in the ModelParams, something like Backend = LLamaSharp.Cuda, or Backend = LLamaSharp.Cpu, etc. (Heck, even if it has to be a global static parameter, that's better than having to make separate builds.)

One possible way it could be implemented is by shipping all of the libllama.dll files in the output directory with a slightly different name (i.e. libllama-cuda.dll, libllama-avx.dll, etc.) and dynamically loading the DLL into memory based on which backend is being requested.

martindevans commented 11 months ago

It sounds like there's two parts to this request:

BrainSlugs83 commented 11 months ago

It looks like you're already trying to load them one at a time in the static API... what would be the harm in loading all of the ones that are found? and just putting the IntPtrs into an array.

We could set the one we want to use via a static enum property -- and the IntPtr could just be a read only property that returns the appropriate IntPtr based on which enum is selected. (Or throws an exception if a missing one was selected.)

We could have another static method that is just like IEnumerable<Backend> AvailableBackends { get; } that does a yield return Backend.xyz if that backend is available...

That would allow hotswap of the dlls at runtime.

Edit: I suppose you wouldn't want the backend to switch out from under any already created LLMs, or whatever... but even then... you could still load them all into an array and use an enum (just don't do the static property stuff). Just pass the enum in as a model param, and it grabs the backend pointer at creation time.

That would allow you to use multiple backends at the same time.

martindevans commented 11 months ago

Loading all of the backends at once wouldn't work with the current system because the native methods are written like this:

[DllImport("libllama")]
public static extern void demo_method();

That will use the already loaded libllama.dll (if there is one) otherwise it will find and load it when called. Given that I'm not sure exactly what would happen it you loaded multiple, but I'm sure it wouldn't be good!

If we just wanted to allow unloading one backend DLL and loading another I think that could probably be done by keeping the IntPtr returned from NativeLibrary.Load and passing it into NativeLibrary.Free later (currently we don't keep that pointer because never unload libraries). In this case no rewriting of the native symbols would be required because there's only ever one loaded library.

If we wanted to allow multiple backends that's a lot more complex. I'm not even 100% sure if it's possible, is it guaranteed that two versions of libllama.dll don't try to use some per-process state internally? Assuming it is we'd probably have to define a non-static "LLamaBackend" object and then fetch all of the native methods using GetExport on a specific backend. Of course we'd probably also want to allow some kind of default backend, so users can continue using the library roughly as-is without having to create a backend object.

If you're interested in experimenting with any of this stuff I'm happy to help with that. Some prototypes in a fork showing some proof-of-concept unloading/reloading/multi-loading would be great!

P.S. one extra complexity I just thought of: NativeLibrary doesn't exist in NETSTANDARD2_0, so there's no cross platform way to do all this stuff there. It can be done, but you need to call the approprate OS methods directly.

saddam213 commented 11 months ago

What if we wrote a bunch of delegates as a sort of wrapped API, and use LoadLibrary to swap dll's out during runtime

However the performance hit could be heavy, also cleanup could become troublesome

    static class KernelNativeMethods
    {
        [DllImport("kernel32.dll")]
        public static extern IntPtr LoadLibrary(string dllToLoad);

        [DllImport("kernel32.dll")]
        public static extern IntPtr GetProcAddress(IntPtr hModule, string procedureName);

        [DllImport("kernel32.dll")]
        public static extern bool FreeLibrary(IntPtr hModule);
    }

    class Program
    {
        [UnmanagedFunctionPointer(CallingConvention.Cdecl)]
        private delegate bool LLamaEmptyCall();

        static void Main(string[] args)
        {
            IntPtr libllama = KernelNativeMethods.LoadLibrary(@"libllama-cuda12.dll");

            IntPtr functionToCallAddress = KernelNativeMethods.GetProcAddress(libllama, "llama_mmap_supported");

            LLamaEmptyCall llama_empty_call = (LLamaEmptyCall)Marshal.GetDelegateForFunctionPointer(functionToCallAddress, typeof(LLamaEmptyCall));

            bool result = llama_empty_call();

            Console.WriteLine(result);

            // Free
            KernelNativeMethods.FreeLibrary(libllama);
        }
    }
martindevans commented 11 months ago

What if we wrote a bunch of delegates as a sort of wrapped API, and use LoadLibrary to swap dll's out during runtime

As I understand it that's roughly equivalent to what I was suggesting with NativeLibrary.Load (cross platform equivalent of kernel32.LoadLibrary and NativeLibrary.GetExport (cross platform equivalent of kernel32.GetProcAddress).

A few problems that I'm not sure about with loading multiple backends:

saddam213 commented 11 months ago

Just support for swapping would be handy,

I am one of the few that run LLamaSharp on servers with GPUs but don't want to use GPU's as they are busy running other apps, so Just hoping we include a way to override the incoming auto-dectect code

In the past n_gpu_layers = 0 resulted in a crash, this could have been resolved as its been several months since I tried

AsakusaRinne commented 11 months ago

To me either supporting multiple backends or swapping backends seems too aggressive now. The largest problem is that there're memories held in native library which we know few about. Things may be out of our control when swapping the backend. I suggest to allow user to select a preferred backend only once before loading llama model now (the "ability to configure the backend before it is loaded" discussed above).

In the condition @saddam213 mentioned, there's a point which needs to be further discussed. If there's a model m running on GPU and now I want it to run only on cpu instead, is the time of re-loading model and context acceptable? Besides, when GPUs are busy running other apps, is the gpu memory also in short? The answer decides whether it's supposed to clear the memories allocated in GPU. If clearing it every time is ok, I think save the current states and reloading the whole process with another backend is reasonable. Otherwise it will be complex. If we switch to another backend, there's a huge risk of memory leak. Even if we solved the memory leak, after executing some inference with the swapped backend, how can we synchronize the new state with the old one?

When we call NativeLibrary.Free, the memories allocated by dll on heap won't be released. If we want to avoid memory leak, we must have an API in c++ to clear all the memories allocated itself. Then we call the API from C# before loading another backend.

Though I'm against dynamically switching backend after loading now, I think this issue is a commonly encountered one when building a service base on LLamaSharp and worth working on. I was just saying we shouldn't make the changes until we eliminate the risks.

hswlab commented 11 months ago

I'm working on an Electron/LLamaSharp application with .NET6 and was wondering if it would be possible to give the user the option to select the backend in the APP settings via a dropdown. The dll would then be downloaded from Github (https://github.com/ggerganov/llama.cpp/releases/latest) and unpacked. The application would then need to be restarted. When starting the application, LLamaSharp just needs to know in which directory the unpacked dll is located.

I noticed this code, where an alternative path to the backend dll could possibly be loaded.

image

I havn't tested the NativeLibraryConfig yet, but this looks very promising for dynamic loading of the backend.
Would it work to call the NativeLibraryConfig.WithLibrary method in the Program.cs at application start with a custom path?

Edit: It seems that WithLibrary(string libraryPath) is not a static method. I think I can't use it so easily as I thought :)

Edit2: I think I figured it out how to use this config. I set this at the beginning of the Program.cs: NativeLibraryConfig.Default.WithLibrary("C:\Users\my_path\llama.dll");

Now the correct path is loaded, but the exception is still called. Probably I'm loading the wrong dll or I'm missing something. I will check it tomorrow ^^'. image

AsakusaRinne commented 11 months ago

Edit: It seems that WithLibrary(string libraryPath) is not a static method. I think I can't use it so easily as I thought :)

Please use NativeLibraryConfig.Default.WithLibrary. Note that this API may be changed to NativeLibraryConfig.Instance in #281.

Would it work to call the NativeLibraryConfig.WithLibrary method in the Program.cs at application start with a custom path?

I think so, it was designed to do this :)

I haven't tested the NativeLibraryConfig yet, but this looks very promising for dynamic loading of the backend.

If you would like to, you can remove the check here to load the library anywhere at runtime to help to check if there's memory leak or other bad behaviours.

martindevans commented 11 months ago

Looks like you've already worked it out, but NativeLibraryConfig.Default.WithLibrary( is the way to do this :)

Please note though that you cannot just download the latest DLL from llama.cpp - you must download exactly right commit version. There is absolutely no compatibility from version to version!

AsakusaRinne commented 11 months ago

Edit2: I think I figured it out how to use this config. I set this at the beginning of the Program.cs: NativeLibraryConfig.Default.WithLibrary("C:\Users\my_path\llama.dll");

Please ensure you loaded a library with name libllama.dll, instead of llama.dll.

hswlab commented 11 months ago

Ok, I couldn't fall asleep without testing it again :D It actually worked after I renamed llama.dll to libllama.dll. Very nice. Thank you very much.

BrainSlugs83 commented 11 months ago

The largest problem is that there're memories held in native library which we know few about. Things may be out of our control when swapping the backend.

That's exactly why I was suggesting having all of the DLLs present loaded at once. -- So that a model which is already loaded, would not be affected; it would just continune working against whichever DLL was passed in to it when it was instantiated.

Yes that will increase memory usage... but if each model is tied to the single version of the backend it was created on, then each model was going to consume that memory anyway.

[DllImport("libllama")] public static extern void demo_method();

Doesn't that make it like... super easy to do what I'm proposing then??

Before you scoff, and think that "wow, this will be a lot more typing and a lot more code to maintain" -- please remember that T4 templates exist -- and literally all of this can be automated to run at build time... to the point where you just pass in an Enum to the model at creation time, and everything else is done for you.

All you would need to do is define one instance of the class (with the regular extern imports in C# which you are already doing), and which DLLs go to which enums... the enums themselves could even be generated by the T4 template -- so in that case you would just need a defintion file -- something like:

{
    "Cuda" : "Cuda\libllama.dll",
    "CpuAvx2": "CpuAvx2\libllama.dll",
    // ...
}

And your T4 template could just read that json definition file as an input, along with a specified C# to modify and it could generate the rest at compile time.

Setup

Let's consider the following setup as a very basic version of what we're talking about:

Let's say I've got 1337\Library1.dll defined as this: extern "C" { __declspec(dllexport) int SomeFunction() { return 1337; } }

And 42000\Library1.dll defined as this: extern "C" { __declspec(dllexport) int SomeFunction() { return 42000; } }

Now, in C# I could do it quick and dirty like so:

[DllImport(@"library1.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern int SomeFunction();

And that works... but you can't switch between the two implementations dynamically.

Simple Interfaces

Below is the dirty version of what I'm suggesting (this would basically be a simplified version of the output of your T4 template. -- I can assist in writing the template itself if you are interested in going this route):

// <generated> This file is generated by the following template: ... </generated>

public static class etern_Library1_1337
{
    [DllImport(@"relative\runtime\path\to\1337\library1.dll", CallingConvention = CallingConvention.Cdecl)]
    public static extern int SomeFunction();
}

public static class etern_Library1_42000
{
    [DllImport(@"relative\runtime\path\to\42000\library1.dll", CallingConvention = CallingConvention.Cdecl)]
    public static extern int SomeFunction();
}

public interface ILibrary1
{
    int SomeFunction();
}

public class Library1_Impl_1337 : ILibrary1
{
    public int SomeFunction() => etern_Library1_1337.SomeFunction();
}

public class Library1_Impl_42000 : ILibrary1
{
    public int SomeFunction() => etern_Library1_42000.SomeFunction();
}

public class Library1Sharp
{
    public ILibrary1 Backend { get; private set; }

    public int SomeFunction() => Backend.SomeFunction();

    public Library1Sharp(ILibrary1 backend)
    {
        this.Backend = backend;
    }
}

So from here, the user just needs to specify the backend when they create their objects (again, preferably via an enum that is managed by the T4 template itself).

Everything else can talk to the unmanaged code via their own version of the interface that was passed in to them.

Motivation & Usage

This design fufills every letter of the SOLID acronym for good software design IMO, so that's part of why I'm suggesting it.

It will be really obvious to the user which model is running on which backend -- and the code doesn't break when you switch -- all you do is create a new instance with the desired model, transfer the state over, and dispose the old one. -- Transfering the state would be the users' responsibility in this case. (Assuming those bugs were to get fixed.)

AsakusaRinne commented 11 months ago

@BrainSlugs83 Thank you a lot for these suggestions. We're always open for any proposal of features. :) Before we start a deep discussion about it, could you please further describe why you want to switch dynamically between cpu backend and cuda backend? If the reason is to switch between a gpu offloaded model and a cpu offloaded one, in #298 we found llama.cpp has supported using pure cpu with cuda-built library.

martindevans commented 11 months ago

I do think this is a viable design, the backend could be specified when you load the model and then from then on it can be hanled automatically within LLamaSharp (LLamaWeights would hold a reference to the backend instance, when you create a context that would in turn hold on to a reference to the backend etc).

That said though, I do think it comes with a very large "complexity cost" for a relatively small benefit to be honest. I'm not against adding it, but I'm also not rushing to do all the necessary work πŸ˜†


Just a note about this:

[DllImport(@"relative\runtime\path\to\1337\library1.dll")]
public static extern int SomeFunction();

As far as I know it must just be the name of the DLL, not an entire path. That's not a fatal flaw though, you could make it work by mashing the whole path down into the name like:

class Foo {
    [DllImport(@"1337_library1.dll")]
    public static extern int SomeFunction();
}

class Bar {
    [DllImport(@"42000_library1.dll")]
    public static extern int SomeFunction();
}

and renaming your libraries as appropriate.

BrainSlugs83 commented 11 months ago

As far as I know it must just be the name of the DLL, not an entire path.

I tested the above with both absolute and relative paths, and they both work. Same directory is not needed -- however, if using relative paths, it needs to be a relative path from the current directory.

Before we start a deep discussion about it, could you please further describe why you want to switch dynamically between cpu backend and cuda backend? If the reason is to switch between a gpu offloaded model and a cpu offloaded one, in https://github.com/SciSharp/LLamaSharp/issues/298 we found llama.cpp has supported using pure cpu with cuda-built library.

That's actually great information that I was not aware of before -- for the time being it sounds like I can now just stick with the CUDA backend. (But there's no guarantee that will work for folks who don't have CUDA installed -- for the release version of my app...)

But at least, previously, I'd run into bugs and instabilities with either CUDA or non-CUDA depending on the version of LlamaSharp, and in the interest of trying to narrow down the bug, I found it very painful to have to constantly change NuGet packages, and I was thinking of other benefits as well such as runtime switching for an end user application. (Because it seems performance varies wildly between the two as well from version to version. 🫀)

I was also thinking of other targets as well like Metal, or a non-avx CPU lib -- or perhaps down the road if other third party backend targets exist (such as an OpenCL, Windows ONNX Acceleration, or even a Vulkan backend), it would be great for compatibility to be able to switch between them on the fly.

AsakusaRinne commented 11 months ago

But there's no guarantee that will work for folks who don't have CUDA installed -- for the release version of my app...

Yes, we are cautious about this feature so we still separate cpu and CUDA backend packages. However I believe this llama.cpp feature means to support running on non-cuda devices.

Besides, in the condition you described, I think the cuda feature auto-detection included in v0.8.0 has supported it. We are thinking about gathering all the native libraries in one backend package (but haven't made the final decision) and automatically choose a library. It will choose cuda backend only when cuda is available on the device.

jhancock4d commented 10 months ago

Couldn't we just automatically and only load those that the CPU/GPUs support based on interrogating the OS? Then calculate the memory required and run it GPU if available and run it CPU if not?

martindevans commented 10 months ago

Couldn't we just automatically and only load those that the CPU/GPUs support based on interrogating the OS?

That's actually what we already do. CUDA binaries are laoded based on the version of CUDA installed and failing that the best CPU binary is loaded (based on which AVX version your CPU supports).

Then calculate the memory required and run it GPU if available and run it CPU if not?

There was some discussion #42 about automatic layers count calculation. It seems like that's complicated even for llama.cpp (which has more information to work it out than LLamaSharp does), so that probably won't happen any time soon/ever unfortunately.

jhancock4d commented 10 months ago

Ok, well LM Studio is doing something, because if I add the CPU and Cuda to a project, it always uses the CPU and ignores CUDA being there, and if I have CUDA and I run out of memory for the request, you just get an out of memory message and your app dies.

In LM Studio, it works using CUDA first if the context fits into GPU VRAM and if it doesn't, then it uses CPU. I can see it doing this with task manager.

It would be nice to have the same results so that our apps aren't so brittle.

BrainSlugs83 commented 10 months ago

Yes, we are cautious about this feature so we still separate cpu and CUDA backend packages. However I believe this llama.cpp feature means to support running on non-cuda devices.

Yeah, the other issue is there are multiple versions of CUDA, right? -- So even if I just packaged an app with CUDA, and were to set GPU layers to 0, I would still have to have two separate release builds of my application for the GPU users depending on if the user had CUDA11 or CUDA12. -- So again, it would be good to be able to switch at runtime... without needed separate NuGet packages for each backend.

AsakusaRinne commented 10 months ago

Yeah, the other issue is there are multiple versions of CUDA, right? -- So even if I just packaged an app with CUDA, and were to set GPU layers to 0, I would still have to have two separate release builds of my application for the GPU users depending on if the user had CUDA11 or CUDA12. -- So again, it would be good to be able to switch at runtime... without needed separate NuGet packages for each backend.

We have already supported detection of cuda version to load a suitable library. The only thing left is to integrate all the backend packages to one package. Actually I intended to make this break change of backend library in v1.0.0. I could add support for specifying base directory in v0.8.1, so that you can take use of this feature by keeping a certain file structure of your native libraries. V0.8.1 will be out within 1 or 2 days.

jhancock4d commented 10 months ago

Would this allow graceful fallback? Would it allow for an OpenCL version of the library as well? (AMD) Ideally we just want to have it work as fast as it can, and not have to expose knowledge of the running machine to be able to use this library. So it needs to just work like LM Studio does.

martindevans commented 10 months ago

I don't think we do anything specific for OpenCL at the moment. but this:

Ideally we just want to have it work as fast as it can

Is definitely the long term intention of the loading system. As much of it should be automated as possible.

AsakusaRinne commented 10 months ago

@BrainSlugs83 v0.8.1 has been out, which allows to specify the search directories of native libraries.

BrainSlugs83 commented 10 months ago

We have already supported detection of cuda version to load a suitable library.

Oh, dang I got confused by this: https://github.com/SciSharp/LLamaSharp/blob/884f5ade133a311177cafeaf8eea5cb4d6954c1d/LLama/Native/NativeApi.Load.cs#L199 Looking closer, I see that you are correct; this is like a fallback for when CUDA isn't detected. So that makes sense.

Actually I intended to make this break change of backend library in v1.0.0

I'm fine with waiting for a proper merged library if it's coming in 1.0.0. πŸ™‚ It's not an urgent ask -- it's just something that will make life easier for us once it's in there.

Would this allow graceful fallback? Would it allow for an OpenCL version of the library as well? (AMD)

The llama.cpp folks already have an OpenCL version (and maybe a ROCm one IIUC), and I think they are planning a Vulkan backend as well (saw it under their discussions...) -- but the Vulkan build is a bit stalled.

v0.8.1 has been out, which allows to specify the search directories of native libraries.

Nice, I'll take a look! πŸ™‚

martindevans commented 8 months ago

OpenCL support will be merged in with #479, and will probably be included in the next release (still needs some work doing to create the new nuget packages).

AsakusaRinne commented 5 months ago

670 suggests a way to use LLamaSharp without backend packages, which might be related with the point of this issue. For your attention here. :)