An experimental self-hosted instance of Llama 2 using LlamaIndex for providing cyber security insights and context.
NOTE
Discovered that using the LlamaIndex persist to disk requires a local embedding model but still requires internet access and an OpenAPI token and auth to HuggableFace.
Embeddings are used in LlamaIndex to represent your documents using a sophisticated numerical representation. Embedding models take text as input, and return a long list of numbers used to capture the semantics of the text.
Llama 2 is Meta's open source large language model (LLM). It's basically the Facebook parent company's response to OpenAI's GPT models and Google's AI models like PaLM 2—but with one key difference: it's freely available for almost anyone to use for research and commercial purposes.
Meta have a great Getting Started page as well as a Getting to Know LLama Juypter book.
LLMs offer a natural language interface between humans and data. Widely available models come pre-trained on huge amounts of publicly available data like Wikipedia, mailing lists, textbooks, source code and more.
However, while LLMs are trained on a great deal of data, they are not trained on your data, which may be private or specific to the problem you’re trying to solve. It’s behind APIs, in SQL databases, or trapped in PDFs and slide decks.
LlamaIndex solves this problem by connecting to these data sources and adding your data to the data LLMs already have. This is often called Retrieval-Augmented Generation (RAG). RAG enables you to use LLMs to query your data, transform it, and generate new insights. You can ask questions about your data, create chatbots, build semi-autonomous agents, and more. To learn more, check out our Use Cases on the left.
Read about the high level concepts of LlamaIndex here
LlamaIndex provides the following tools:
Data connectors ingest your existing data from their native source and format. These could be APIs, PDFs, SQL, and (much) more.
Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.
Engines provide natural language access to your data. For example: - Query engines are powerful retrieval interfaces for knowledge-augmented output. - Chat engines are conversational interfaces for multi-message, “back and forth” interactions with your data.
Data agents are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more.
Application integrations tie LlamaIndex back into the rest of your ecosystem. This could be LangChain, Flask, Docker, ChatGPT, or… anything else!
LLMs are trained on enormous bodies of data but they aren’t trained on your data. Retrieval-Augmented Generation (RAG) solves this problem by adding your data to the data LLMs already have access to. You will see references to RAG frequently in this documentation.
In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.
Even if what you’re building is a chatbot or an agent, you’ll want to know RAG techniques for getting data into your application.
Stages within RAG
There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:
Read more here.
Why FastAPI?
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.8+ based on standard Python type hints.
The key features are:
Install the latest version of Python from python.org or your favourite package manager
Create a virtual environment: python -m venv .venv
Activate the virtual environment: .venv/Scripts/activate
Installation will fail if a C++ compiler cannot be located. To get one:
# Install required libraries
pip install -r requirements.txt
cd .\app
python .\main.py
http://localhost:8000/docs
to see the Swagger OpenAPI docs.# Use POST as there is no limit to characters like GET
import requests
import json
url = "http://localhost:8000/ask"
payload = json.dumps({
"question": "Q: How many failed password attempts were there?"
})
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The response was:
There were 10 failed password attempts in the log file.
Snippet of JSON response:
{"answer":{"response":" There were 10 failed password attempts in the log file.","source_nodes":[{"node":{"id_":"b5223a47-5b13-462c-8755-3b9430d3d984","embedding":null,"metadata":{"file_path":"..\\storage\\OpenSSH_2k.log_structured.csv","file_name":"OpenSSH_2k.log_structured.csv","file_type":"text/csv","file_size":357677,"creation_date":"2023-11-24","last_modified_date":"2023-11-22","last_accessed_date":"2023-11-24"},"excluded_embed_metadata_keys":["file_name","file_type","file_size","creation_date","last_modified_date","last_accessed_date"],"excluded_llm_metadata_keys":["file_name","file_type","file_size","creation_date","last_modified_date","last_accessed_date"],"relationships":{"1":{"node_id":"8b2e22fe-2544-4d5d-99a8-eecdc86e31a4","node_type":"4","metadata":{"file_path":"..\\storage\\OpenSSH_2k.log_structured.csv","file_name":"OpenSSH_2k.log_structured.csv","file_type":"text/csv","file_size":357677,"creation_date":"2023-11-24","last_modified_date":"2023-11-22","last_accessed_date":"2023-11-24"},"hash":"a6256618596266c0bff3b825e045300cfbca908b591d5bbb65a484f3eb98284f"},"2":{"node_id":"77838434-3f21-460f-99ae-5971b79cf67f","node_type":"1","metadata":{"file_path":"..\\storage\\OpenSSH_2k.log_structured.csv","file_name":"OpenSSH_2k.log_structured.csv","file_type":"text/csv",
Response times:
llama_print_timings: load time = 10008.56 ms
llama_print_timings: sample time = 7.26 ms / 14 runs ( 0.52 ms per token, 1927.84 tokens per second)
llama_print_timings: prompt eval time = 874970.77 ms / 1546 tokens ( 565.96 ms per token, 1.77 tokens per second)
llama_print_timings: eval time = 11232.27 ms / 13 runs ( 864.02 ms per token, 1.16 tokens per second)
llama_print_timings: total time = 887042.66 ms
With a decent CPU but without any GPU assistance, expect output on the order of 1 token per second, and excruciatingly slow prompt ingestion. Any decent Nvidia GPU will dramatically speed up ingestion, but for fast generation, you need 48GB VRAM to fit the entire model. That means 2x RTX 3090 or better.
Azure Virtual Machine Size | CPU | GPU | Time to Response in ms |
---|---|---|---|
Standard_D4s_v4 | Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz | None | 600000 ms |
NC4as_T4_v3 | Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz | 16GB Nvidia Tesla T4 GPU | xxxx ms |
Azure Virtual Machine Size | CPU | GPU | Time to Response in ms |
---|---|---|---|
Standard_D4s_v4 | Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz | None | 887042 ms |
NC4as_T4_v3 | Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz | 16GB Nvidia Tesla T4 GPU | xxxx ms |