SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.7k stars 349 forks source link

[Proposal] Backend-free support #670

Open AsakusaRinne opened 7 months ago

AsakusaRinne commented 7 months ago

Introduction

LLamaSharp uses llama.cpp as a backend, and have introduced dynamic native library loading, which allows us to choose which DLL to load at runtime. However, users still need to install the backend packages unless they have exactly one DLL to use. The problem is, at most of the time, an user only needs one DLL, for example, the CUDA11 one. However, many DLLs have to be included, especially if we support CUDA with AVX in the future.

Dividing into backend packages with a single file, as previously discussed in other issues, appears to be a solution. However, if the user has chosen a specific backend, what is the purpose of our backend selection strategy? Furthermore, this approach may lead to an excessive number of backends, causing potential difficulties.

Is it possible to select the native library by the configuration and system information, and only download the selected one, and without having too many backend packages? This is the point of this proposal.

Brief Description

My idea is to put all the native library files on HuggingFace, then download the selected one according to the configuration and system information at runtime. That's all!

APIs

The following APIs will be exposed for users to get this feature.

// Use along with other strategies such as `WithCuda`.
NativeLibraryConfig NativeLibraryConfig::WithAutoDownload(bool enable = true, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the library with filename and version.
void NativeLibraryConfig::DownloadNativeLibrary(string filename, string? version = null, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the library with specified configurations.
void NativeLibraryConfig::DownloadNativeLibrary(bool useCuda, AvxLevel avxLevel, string os = "auto", string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the best library (for efficiency) selected by LLamaSharp according to detected system info.
void NativeLibraryConfig::DownloadBestNativeLibrary(string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

p.s. To be honest, I don't think it's good to put the methods for downloading in NativeLibraryConfig, but I haven't come up with a better idea yet.

Behaviors

Priorities

The most important thing is that what the behavior is when this feature is used with backend installed?

My answer would be that we'll follow the priorities below.

  1. If a local file is specified by WithLibrary, just load it.
  2. If a backend has been installed, try to load a library with the configuration. If no matched file could be found, fallback to 3.
  3. Search the default native library cache directory at first. If no matched file could be found, try to download it.
  4. If there's still no matched file, throw an exception.

Directory structure

We will cache the files in a default directory (may be ~/.llama-sharp) or a specified one by user. In this directory, we will make subdirectories named by version, in which there are downloaded files.

In this way, there're two possible directory structures, which are listed as below.

the first one, flatten all the files

Root
  |------v0.11.2
            |------llama-cuda11-win-x64.dll
            |------libllama-avx512-linux-x64.so
 |------v0.12.0
            |------llama-cuda12-win-x64.dll
            |------libllama-metal-osx-x64.dylib

the second one, keep the current structure

Root
  |------v0.11.2
            |------cuda11
                   |------llama.dll
                   |------libllama.so
            |------cpu
                   |------llama.dll
                   |------libllama.so
                   |------libllama.dylib
 |------v0.12.0
           ... ...

I'm open to this and will leave the decision till the final time, depending on discussions about it.

How to implement

Downloading files from Huggingface

It would not be implemented in LLamaSharp. I'll create a repo named HuggingfaceHub and I'm already working on it. I'm pretty sure that the downloading could be implemented without too many difficulties.

As an evidence, llama.cpp has already had an example function to download model files from Huggingface. In this proposal, the downloading will be more complex because we are making a library API rather than an example, but I think I could hold this.

After the completion of the this library, We could depend on it in LLamaSharp to download files. The reason why I won't put it in LLamaSharp is because:

Pushing files to Huggingface

I'll do this in our CI. We only need to put files when we are going to publish a new release. I'll add a secret key to github actions secrets, and use huggingface-cli to push files.

Advantages

I believe this feature will bring the following advantages:

Potential risks


I would appreciate for any suggestion for this proposal!

martindevans commented 7 months ago

This is a pretty interesting idea, it does solve the explosion of backends DLLs we have but still keeps the advantage of feature auto selection for end-users.

My main concern (to add to the potential risks) is security - there's obviously a huge security risk in downloading DLLs and executing them as part of a progam (rather than just as part of the install step). I think we should include this as a separate backend (e.g. LLamaSharp.Backend.Automatic), which can be installed if people want to use this. In my own use, for example, I would avoid using it because I'm deploying to one single server so I know exactly which backend I need. However, if I were distributing an app to end users (who might have any combination of CPU features and GPU types) I'd definitely want to use auto downloading!

Some comments on specific implementation details:

HuggingFace

When we build the release could we embed the commit ID directly into the source code and release that? That way you can just download from GitHub e.g. https://github.com/SciSharp/LLamaSharp/blob/a8ba9f05b3f44cdd3368310b32b211245eda17bc/LLama/runtimes/deps/avx512/llama.dll.

This has two advantages, it reduces security exposure slightly and means zero extra work on deployments!

APIs

We'd probably want to integrate it into NativeLibraryConfig somehow, but we'd need to offer an async API as well to prevent hangs. So you could call something like: await LLamaNative.AsyncLoad().

flattening

I experimented with flattening and it doesn't work :(

We can flatten our DLLs, but as soon as one of them depends on another DLL which we don't load directly (e.g. clblast.dll) it fails because the name is wrong. Afaik there's no way to fix that.

Downloading weights

This seems like a separate feature to loading DLLs and I think it'd be really cool! There's a lot less inherent risk downloading model weights instead of DLLs.

The proposed API ( new LLamaWeights("Facebook/LLaMA", "llama2.gguf") doesn't work, such a huge download it has to be async and a constructor can't be async. Easy fix though, we can easily easily add a new factory method like:

public static LLamaWeights LoadFromFile(IModelParams @params);
public static async Task<LLamaWeights> LoadFromHuggingFace(string name, IModelParams @params);
vvdb-architecture commented 7 months ago

I concur with @martindevans : downloading on the fly is a security risk and for back-ends with LLamaSharp in most enterprise environments this will be blocked outright. So the alternative of manually associating a back-end should still be an option.

SignalRT commented 7 months ago

I think that it could be a good use case to distribute an application to different desktop environments, but it doesn't fit on server environments where you will be using containers to deploy the code. From the security side, and after XZ backdoor , I don't see this as a step to be more secure.

AsakusaRinne commented 7 months ago

I think we should include this as a separate backend (e.g. LLamaSharp.Backend.Automatic), which can be installed if people want to use this.

My idea is a bit different from it, I prefer to add the auto-downloading as a configuration in the main package, for example, NativeLibraryLoad.WithAutoDownload(). That's because if the user neither installs a backend nor specify a self-managed native library path, it seems that he/she doesn't know which native library to use, or it just doesn't matter. In this case, auto-downloading should be an available option, instead of asking the user to install a backend for auto-downloading again. As illustrated in the proposal, if a backend has been installed, it will have higher priority than auto-downloading.

When we build the release could we embed the commit ID directly into the source code and release that?

I agree with that. We could add a static variable to LLamaSharp to mark the default commit_id for the current version.

That way you can just download from GitHub e.g. https://github.com/SciSharp/LLamaSharp/blob/a8ba9f05b3f44cdd3368310b32b211245eda17bc/LLama/runtimes/deps/avx512/llama.dll

I'm not sure if this could be done without manually clicking the download button on the website. Afaik, we couldn't use wget to download file directly from a github repo. Please correct me if I'm wrong, I would certainly like to use github instead of huggingface as our blob.


The security issue is absolutely one of the most important things. I don't mean to drop the existing backends, but only provide one more option for users, especially for those who is freshman for programming/LLM and who wants to distribute desktop apps. It could be disabled by default and depends on user that whether to enable it. :)

zsogitbe commented 7 months ago

I am sorry, but I do not think that this is the right solution for several reasons:

What I am doing now takes 10 minutes to make the backend. I use the CMake gui and generate a VS solution and compile llama.dll and llava_shared.dll. I use these directly in the C# code.

But even this can be automated with the CMake script I have started in https://github.com/SciSharp/LLamaSharp/tree/experimental_cpp. You could finalize that and have any backend setup Release/Debug automatically to any user.

martindevans commented 7 months ago

it does not provide debug versions

It could do if we wanted to (just add a WithLLamaCppDebug(true) method). However there have been issues before with debug builds not working with LLamaSharp (we never got to the bottom of it, it looked like a DEBUG-only bug in llama.cpp) so we'd have to be careful with testing when introducing that.

it cannot provide all possible setups

It can provide more setups if done right. Our current backend packages can package anything of course, but because they're packaged together we do have to make a decision on what's worth including (e.g. CUDA with every AVX variant would massively bloat things). With auto download we could just have every variant precompiled and sitting in the repo, ready for auto download. Less bloat, more hardware support!

One proposal we've discussed before is to provide every individual backend as a different nuget package with exactly one binary in it (e.g. LLamaSharp.Backend.CUDA11_AVX512), then we would provide our current backend packages which simply depend on all the relevant backends (e.g. LLamaSharp.Backend.CUDA11 depends on LLamaSharp.Backend.CUDA11_AVX512 and all other variants). This allows you to select exactly one specific backend if you want (just depend on what you need) or to be a bit more general (all CUDA11). I think that'd be an improvement for server-deployed apps over how we currently have things set up.

security problems with dlls made by someone else

Agreed, definitely a problem with providing native dependencies. I've moved all the build work over to GH actions, so it's fairly public/auditable, but still not perfect.

CMake etc...

This doesn't solve half of the problem. There are two types of applications with very different needs from backends.

Server apps are compiled and deployed to a specific platform and in practice only need one specific backend. Cmake works for that - you just compile the DLL you need. That's fully supported at the moment, since you can install no backend package and just drop the DLL in the appropriate place. I do think your work with cmake/submodules to make this use-case easier is valuable!

However applications deployed to an end-user (e.g. a game using LLamaSharp) cannot just ship one single backend. You absolutely need feature detection to select the best backend for whatever the user has. Obviously you have no idea what that might be when shipping the app and if feature detection wasn't built into LLamaSharp everyone would need to build it themselves.

The current backend situation isn't ideal for the app usecase since you need to ship a load of DLLs to every user even though they only need one, you just don't know which one. This is where auto downloading would be great.

AsakusaRinne commented 7 months ago

About the security problem, I think we could allow user to specify a huggingface repo, instead of using the official one provided by us. If we take one step further, we could let user decide the full downloading behavior. For example, providing an APIs like below.

NativeLibraryConfig.SetDownloadHandler(IDownloadHandler handler);
public interface IDownloadHandler{
    // We pass the best configuration we detected, but the user could certainly ignore it. 
    // The `Path` in `recommendedConfiguration` is an url pointing to a file in our official repo.
    // The returned value should contains the real path to the local file, and the selected library type.
    NativeLibraryInfo Download(NativeLibraryInfo recommendedConfiguration);
}

// Path: the path of the library file.
// IsCuda: whether it's compiled with CUDA.
// AvxLevel: which avx the livrary uses.
// CommitHash: optional, the commit id of the llama.cpp repo it compiled from.
public record class NativeLibraryInfo(string Path, bool IsCuda, AvxLevel AvxLevel, string? CommitHash = null);

Though it's still not perfect, it could help reduce the risks of security problem.

zsogitbe commented 7 months ago

My personal opinion is that you make this much more complicated than it is necessary. A game developer you mentioned above would just run CMake with 2-3 configurations (each takes 5-6 min) and get a different VS solution for each setup. Or you could just make one VS solution with a fine tuned CMake script where you just select the Configuration in VS... I do not think that a professional software company would distribute DLLs from someone else if they can just compile it easily...

AsakusaRinne commented 7 months ago

I've made a prototype of the library I mentioned above -- HuggingfaceHub. It could download files from huggingface now, though it needs more test. At least, it proved that it's absolutely possible to implement this proposal, and possibly support automatically downloading models in the future.

I do not think that a professional software company would distribute DLLs from someone else if they can just compile it easily.

@zsogitbe I agree that a professional software company won't use the auto-downloading in their distributions. However, it's as much as important to provide convenience for developers/users that are not so experienced. As you can see, most of the issues and PRs are opened by individuals, instead of employees whose company uses LLamaSharp. I started programming at 2016 with VB.NET but had never used CMake until 2020. I believe there are many .NET developers who don't have experience with C++/CMake. Thus it's necessary to provide a way to make them use LLamaSharp as easily as possible.

Your work at experimental_cpp, if I'm not misunderstanding it, is mainly to solve the problem that the user could only compile llama.cpp at CLI instead of GUI. Would you like to add some docs about how to use it with GUI (maybe some screenshots?) so that we could merge it into branch, to let users know this option?

martindevans commented 7 months ago

A game developer you mentioned above would just run CMake with 2-3 configurations

But what 2-3 configurations would they choose (there are a lot more than 2-3 possible configurations)? Once they had those 3 DLLs how would the software select which one to use on the end-user machine? This is back to the problem we're trying solve by distributing all the builds and using auto selection, except now the developer has to implement it all instead of us!

Cmake and self built DLLs is great for server software, but is orthogonal to the other issue of selecting which backend to use for applications.

I do not think that a professional software company would distribute DLLs from someone

Agreed, but they will probably still need auto selection (unless they rebuilt it themselves as part of the installer) and can maybe even use auto downloading if we allow it to be configured with self-hosted URLs.

zsogitbe commented 7 months ago

@zsogitbe I agree that a professional software company won't use the auto-downloading in their distributions. However, it's as much as important to provide convenience for developers/users that are not so experienced. As you can see, most of the issues and PRs are opened by individuals, instead of employees whose company uses LLamaSharp. I started programming at 2016 with VB.NET but had never used CMake until 2020. I believe there are many .NET developers who don't have experience with C++/CMake. Thus it's necessary to provide a way to make them use LLamaSharp as easily as possible.

Your work at experimental_cpp, if I'm not misunderstanding it, is mainly to solve the problem that the user could only compile llama.cpp at CLI instead of GUI. Would you like to add some docs about how to use it with GUI (maybe some screenshots?) so that we could merge it into branch, to let users know this option?

Maybe we need to think more about it to find the best solution. The CMake GUI (Graphical User Interface) provides an interactive way to configure CMake projects. You define some parameters (cuda, avx2,...) and it generates the VS solution automatically. After this you add the C++ projects you need to the C# projects like this (this is how I use the library - clean and easy to understand):

image

But, I do not know how to generate multiplatform DLL's on Windows (so for Mac and Linux).

My initial work on experimental_cpp attempts to do the above automatically (with manually changing some parameters in CMakeList.txt for choosing cuda, avx2, etc.).

zsogitbe commented 7 months ago

Added PR for automatic solution generator work in progress code: https://github.com/SciSharp/LLamaSharp/pull/674/commits/9c91fac20f3ebde5d1f1bc6a9feacaaa61c4d087

dluc commented 6 months ago

While the idea of downloading might work for local boxes, IMO it's a no-no for production use. E.g. downloading remote code at start means having a backdoor open for code injection, plus a bootstrap performance killer for lambdas. Even with some caching (which would complicate things), the security concern is pretty big. Packages should have a strong signature with a trusted cert (who would decide what to trust?) and the client downloading assemblies would have to verify these signatures. Anyway, seems a lot of additional complexity and trust :-)

Without going too much off topic, in .NET there's a pretty robust dependency injection framework - why not allow to detect the plat hw, and then simply inject the right backend, without overriding assembly names?

AsakusaRinne commented 6 months ago

@dluc Would it sound reasonable if we leave it up to the user to decide whether or not to use this feature? As you can see here, developers who use LLamaSharp could insert a downloading process before the library loading. What I made in that PR is trying to add an official implementation of this feature.

I have a question here. There're many applications who have a plugin system. When the user download the plugin, it's actually downloading some remote code. What's the difference between the plugin downloading and this proposal? Is the plugin downloading more safe for some reasons?

why not allow to detect the plat hw, and then simply inject the right backend, without overriding assembly names?

We have supported selecting the right native library according to the system information (code, doc). However it's triggered via setting the DllImportResolver instead of the dependency injection. Could you please talk more about how to implement the dynamic loading via dependency injection?