dalemorson / llama-2-llama-index-api

An experimental self-hosted instance of Llama 2 for providing cyber security insights and context.
2 stars 1 forks source link

Llama 2 Llama Index API

An experimental self-hosted instance of Llama 2 using LlamaIndex for providing cyber security insights and context.


NOTE

Discovered that using the LlamaIndex persist to disk requires a local embedding model but still requires internet access and an OpenAPI token and auth to HuggableFace.

Embeddings are used in LlamaIndex to represent your documents using a sophisticated numerical representation. Embedding models take text as input, and return a long list of numbers used to capture the semantics of the text.


Architecture

Llama 2

Why Llama 2?

Llama 2 is Meta's open source large language model (LLM). It's basically the Facebook parent company's response to OpenAI's GPT models and Google's AI models like PaLM 2—but with one key difference: it's freely available for almost anyone to use for research and commercial purposes.

Meta have a great Getting Started page as well as a Getting to Know LLama Juypter book.

LlamaIndex

LLMs offer a natural language interface between humans and data. Widely available models come pre-trained on huge amounts of publicly available data like Wikipedia, mailing lists, textbooks, source code and more.

However, while LLMs are trained on a great deal of data, they are not trained on your data, which may be private or specific to the problem you’re trying to solve. It’s behind APIs, in SQL databases, or trapped in PDFs and slide decks.

LlamaIndex solves this problem by connecting to these data sources and adding your data to the data LLMs already have. This is often called Retrieval-Augmented Generation (RAG). RAG enables you to use LLMs to query your data, transform it, and generate new insights. You can ask questions about your data, create chatbots, build semi-autonomous agents, and more. To learn more, check out our Use Cases on the left.

Read about the high level concepts of LlamaIndex here

How can LlamaIndex help?

LlamaIndex provides the following tools:

Retrieval Augmented Generation (RAG)

LLMs are trained on enormous bodies of data but they aren’t trained on your data. Retrieval-Augmented Generation (RAG) solves this problem by adding your data to the data LLMs already have access to. You will see references to RAG frequently in this documentation.

In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

Even if what you’re building is a chatbot or an agent, you’ll want to know RAG techniques for getting data into your application.

alt text

Stages within RAG

There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

Read more here.

FastAPI

Why FastAPI?

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.8+ based on standard Python type hints.

The key features are:

Getting Started with the API

Step 1: Install Pre-reqs on Windows

Installation will fail if a C++ compiler cannot be located. To get one:

# Install required libraries
pip install -r requirements.txt

Step 2: Dump data into data folder

Step 3: Load API

cd .\app
python .\main.py

Step 4: Call the API

# Use POST as there is no limit to characters like GET

import requests
import json

url = "http://localhost:8000/ask"

payload = json.dumps({
  "question": "Q: How many failed password attempts were there?"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

The response was:

There were 10 failed password attempts in the log file.

Snippet of JSON response:

{"answer":{"response":" There were 10 failed password attempts in the log file.","source_nodes":[{"node":{"id_":"b5223a47-5b13-462c-8755-3b9430d3d984","embedding":null,"metadata":{"file_path":"..\\storage\\OpenSSH_2k.log_structured.csv","file_name":"OpenSSH_2k.log_structured.csv","file_type":"text/csv","file_size":357677,"creation_date":"2023-11-24","last_modified_date":"2023-11-22","last_accessed_date":"2023-11-24"},"excluded_embed_metadata_keys":["file_name","file_type","file_size","creation_date","last_modified_date","last_accessed_date"],"excluded_llm_metadata_keys":["file_name","file_type","file_size","creation_date","last_modified_date","last_accessed_date"],"relationships":{"1":{"node_id":"8b2e22fe-2544-4d5d-99a8-eecdc86e31a4","node_type":"4","metadata":{"file_path":"..\\storage\\OpenSSH_2k.log_structured.csv","file_name":"OpenSSH_2k.log_structured.csv","file_type":"text/csv","file_size":357677,"creation_date":"2023-11-24","last_modified_date":"2023-11-22","last_accessed_date":"2023-11-24"},"hash":"a6256618596266c0bff3b825e045300cfbca908b591d5bbb65a484f3eb98284f"},"2":{"node_id":"77838434-3f21-460f-99ae-5971b79cf67f","node_type":"1","metadata":{"file_path":"..\\storage\\OpenSSH_2k.log_structured.csv","file_name":"OpenSSH_2k.log_structured.csv","file_type":"text/csv",

Response times:

llama_print_timings:        load time =   10008.56 ms
llama_print_timings:      sample time =       7.26 ms /    14 runs   (    0.52 ms per token,  1927.84 tokens per second)
llama_print_timings: prompt eval time =  874970.77 ms /  1546 tokens (  565.96 ms per token,     1.77 tokens per second)
llama_print_timings:        eval time =   11232.27 ms /    13 runs   (  864.02 ms per token,     1.16 tokens per second)
llama_print_timings:       total time =  887042.66 ms

CPU vs GPU Comparisons

With a decent CPU but without any GPU assistance, expect output on the order of 1 token per second, and excruciatingly slow prompt ingestion. Any decent Nvidia GPU will dramatically speed up ingestion, but for fast generation, you need 48GB VRAM to fit the entire model. That means 2x RTX 3090 or better.

Time to Load Model

Azure Virtual Machine Size CPU GPU Time to Response in ms
Standard_D4s_v4 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz None 600000 ms
NC4as_T4_v3 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 16GB Nvidia Tesla T4 GPU xxxx ms

Response Times to Question

Azure Virtual Machine Size CPU GPU Time to Response in ms
Standard_D4s_v4 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz None 887042 ms
NC4as_T4_v3 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 16GB Nvidia Tesla T4 GPU xxxx ms

Associated Projects