SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.46k stars 330 forks source link

Add a best practice example for RAG #648

Open AsakusaRinne opened 5 months ago

AsakusaRinne commented 5 months ago

A better example with guide is needed for RAG. It could be considered with the following aspects.

  1. Which model to use for generating embeddings?
  2. Which tool to use to get a best practice?
  3. Is any extra API required to be provided in LLamaSharp for easy-to-use RAG?
  4. Is it possible if the user want to use parallel inference (batched inference) with RAG?
WesselvanGils commented 4 months ago

I'm trying to make basically exactly this right now. I got the BatchedExecutor figured out recently but now trying to integrate RAG into that pipeline is proving difficult. I wouldn't mind turning my final result into an example.

I'd like to verify an assumption as well: "When combining text generation and RAG in one application 3 model instances are needed, one for generating embeddings, one for retrieval generation and one for text generation". I feel like those last two instances could be one but I don't know if this to be possible because when creating KernelMemory a seperate model is instantiated.

I'm currently have this

using LLama.Native;
using LLamaSharp.KernelMemory;
using Microsoft.KernelMemory;
using Microsoft.KernelMemory.FileSystem.DevTools;
using Microsoft.KernelMemory.MemoryStorage.DevTools;

string nativePath = "<path to native llama>";
NativeLibraryConfig.Instance.WithLibrary(nativePath, null);

string generationModelPath = "<path to any LLM in GGUF format>";
string embeddingModelPath = "<path to any embedding model in GGUF format>";
string storageFolder = "<path to storage folder>";

var llamaGenerationConfig = new LLamaSharpConfig(generationModelPath);
var llamaEmbeddingConfig = new LLamaSharpConfig(embeddingModelPath);
var vectorDbConfig = new SimpleVectorDbConfig() { Directory = storageFolder, StorageType = FileSystemTypes.Disk };

var memory = new KernelMemoryBuilder()
    .WithLLamaSharpTextGeneration(llamaGenerationConfig)
    .WithLLamaSharpTextEmbeddingGeneration(llamaEmbeddingConfig)
    .WithSimpleVectorDb(vectorDbConfig)
    .Build();

Console.WriteLine("\n================== INGESTION ==================\n");

Console.WriteLine("Uploading text about E=mc^2");
await memory.ImportTextAsync("""
    In physics, mass–energy equivalence is the relationship between mass and energy 
    in a system's rest frame, where the two quantities differ only by a multiplicative
    constant and the units of measurement. The principle is described by the physicist
    Albert Einstein's formula: E = m*c^2
""");

Console.WriteLine("Uploading article file about Carbon");
await memory.ImportDocumentAsync("wikipedia.txt");

Console.WriteLine("\n================== RETRIEVAL ==================\n");

var question = "What's E = m*c^2?";
Console.WriteLine($"Question: {question}");

var answer = await memory.AskAsync(question);
Console.WriteLine($"\nAnswer: {answer.Result}\n\n  Sources:\n");

// Show sources / citations
foreach (var x in answer.RelevantSources)
{
    Console.WriteLine(x.SourceUrl != null
        ? $"  - {x.SourceUrl} [{x.Partitions.First().LastUpdate:D}]"
        : $"  - {x.SourceName}  - {x.Link} [{x.Partitions.First().LastUpdate:D}]");
}

I adapted this from this example on KernelMemory from Microsoft. But its current answer to everything is:

warn: Microsoft.KernelMemory.Search.SearchClient[0]
      No memories available

Answer: INFO NOT FOUND

  Sources:

Edit: I fixed this by removing the minRelevance parameter from AskAsync()

AsakusaRinne commented 4 months ago

I'd like to verify an assumption as well: "When combining text generation and RAG in one application 3 model instances are needed, one for generating embeddings, one for retrieval generation and one for text generation". I feel like those last two instances could be one but I don't know if this to be possible because when creating KernelMemory a seperate model is instantiated.

I agree that 3 models are needed, however I think the second one is actually not necessary to be a LLM. It could be an algorithm to find similarity of embeddings. Therefore the last two models is less likely to be merged into one.

TBH I'm not an expert of RAG, either. I think you will get a much better answer if you ask this question in kernel-memory issues. :)

Thank you a lot for looking into this issue!

WesselvanGils commented 4 months ago

I did actually manage to figure this out with semantic memory. I'll put a proper example for that version together tomorrow. The advantage of that over the solution above is that it actually just returns context using cosine similarity on embeddings so you can utilize any executor just by adding the context to the prompt.

WesselvanGils commented 4 months ago
using LLama;
using LLama.Common;
using LLama.Native;
using LLamaSharp.SemanticKernel.TextEmbedding;
using Microsoft.SemanticKernel.Connectors.Sqlite;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Text;
using System.Text;

// Initialize native library before anything else
string llamaPath = Path.GetFullPath("<path to local lib>/libllama.so");
NativeLibraryConfig.Instance.WithLibrary(llamaPath, null);

// Download a document and create embeddings for it
#pragma warning disable SKEXP0050, SKEXP0001, SKEXP0020

var embeddingModelPath = Path.GetFullPath("<path to embed model>/nomic-embed.gguf");
var embeddingParameters = new ModelParams(embeddingModelPath) { ContextSize = 4096, GpuLayerCount = 13, Embeddings = true };
var embeddingWeights = LLamaWeights.LoadFromFile(embeddingParameters);
var embedder = new LLamaEmbedder(embeddingWeights, embeddingParameters);

var service = new LLamaSharpEmbeddingGeneration(embedder);

ISemanticTextMemory memory = new MemoryBuilder()
    .WithMemoryStore(await SqliteMemoryStore.ConnectAsync("mydata.db"))
    .WithTextEmbeddingGeneration(service)
    .Build();

Console.WriteLine("===== INGESTING =====");

IList<string> collections = await memory.GetCollectionsAsync();

string folderPath = Path.GetFullPath("<path to folder>/Embeddings");
string[] files = Directory.GetFiles(folderPath);

string collectionName = "TestCollection";

if (collections.Contains(collectionName))
{
    Console.WriteLine("Found database");
}
else
{
    foreach (var item in files.Select((path, index) => new { path, index }))
    {
        Console.WriteLine($"Ingesting file #{item.index}");
        string text = File.ReadAllText(item.path);
        var paragraphs = TextChunker.SplitPlainTextParagraphs(TextChunker.SplitPlainTextLines(text, 128), 512);

        foreach (var para in paragraphs.Select((text, index) => new { text, index } ))
            await memory.SaveInformationAsync(collectionName, para.text, $"Document {item.path}, Paragraph {para.index}");
    }

    Console.WriteLine("Generated database");
}
Console.WriteLine("===== DONE INGESTING =====");

StringBuilder builder = new();

Console.Write("Question: ");
string question = Console.ReadLine()!;
builder.Clear();

Console.WriteLine("===== RETRIEVING =====");

List<string> sources = [];
await foreach (var result in memory.SearchAsync(collectionName, question, limit: 1, minRelevanceScore: 0))
{
    builder.AppendLine(result.Metadata.Text);
    sources.Add(result.Metadata.Id);
}

builder.AppendLine("""

Sources:
""");

foreach (string source in sources)
{
    builder.AppendLine($"    {source}");
}
Console.WriteLine("===== DONE RETRIEVING =====");

Console.WriteLine(builder.ToString());

#pragma warning restore SKEXP0001, SKEXP0050, SKEXP0020

We have to supress some warnings here because semantic memory is technically considered experimental. This just uses LLamaSharp to generate embeddings and allows us to search anything compatible with Semantic Memory with those embeddings returning the most relevant text chunks. This doesn't do any generation so you'd have to add the context to the prompt manually.

Some things to consider is that this is generally the fist step of RAG and there are a lot of steps you can add in between this and adding it to the prompt. Such as returning multiple sources and reranking them, summirzation and so on. I'll leave some helpful resources as well: https://github.com/pchunduri6/rag-demystified https://medium.com/@thakermadhav/build-your-own-rag-with-mistral-7b-and-langchain-97d0c92fa146 https://medium.com/@talon8080/mastering-rag-chatbots-building-advanced-rag-as-a-conversational-ai-tool-with-langchain-d740493ff328

AsakusaRinne commented 4 months ago

The example looks good. @xbotter Do you have any idea about further improve it?