ChatRTX is a demo app that lets you personalize a GPT large language model (LLM) connected to your own content—docs, notes, photos. Leveraging retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration, you can query a custom chatbot to quickly get contextually relevant answers. This app also lets you give query through your voice and lets you retreive images matching your voice or text input. And because it all runs locally on your Windows RTX PC or workstation, you’ll get fast and secure results. ChatRTX supports various file formats, including text, pdf, doc/docx, xml, png, jpg, bmp. Simply point the application at the folder containing your files and it'll load them into the library in a matter of seconds.
The AI models that are supported in this app:
The pipeline incorporates the above AI models, TensorRT-LLM, LlamaIndex and the FAISS vector search library. In the sample application here, we have a dataset consists of recent articles sourced from NVIDIA Gefore News.
Retrieval-augmented generation (RAG) for large language models (LLMs) seeks to enhance prediction accuracy by connecting the LLM to your data during inference. This approach constructs a comprehensive prompt enriched with context, historical data, and recent or relevant knowledge.
If you are using ChatRTX installer, setup of the models selected during installation is done by the installer. You can skip the insatllation steps below, launch the installed 'NVIDIA ChatRTX' desktop icon, and refer to the Use additional model section to add additional models.
Install Python 3.10.11 or create a virtual environment.
python3.10 -m venv ChatRTX
ChatRTX\Scripts\activate
You can also use conda to create your virtual environment (optional)
conda create -n chatrtx_env python=3.10
conda activate chatrtx_env
Clone ChatRTX code repo into a local dir (%ChatRTX Folder%) using Git for Windows, and install necessary dependencies. This directory will be the root directory for this guide.
git clone https://github.com/NVIDIA/trt-llm-rag-windows.git
cd trt-llm-rag-windows # root dir
#install dependencies
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cu121
Install TensorRT-LLM wheel. The wheel is already present in the wheel directory.
cd wheel
pip install tensorrt_llm-0.9.0-cp310-cp310-win_amd64.whl --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121
Download 'ngcsdk-3.41.2-py3-none-any.whl' from here and install it using the command below. This enables us to downloads from NGC:
pip install .\ngcsdk-3.41.2-py3-none-any.whl
Microsoft MPI Download and install Microsoft MPI. You will be prompted to choose between an exe, which installs the MPI executable, and an msi, which installs the MPI SDK. Download and install both.
In this project, we use the AWQ int4 quantized models for the LLMs. Before using it, you'll need to build a TensorRT Engine specific to your GPU. Below we have the steps to build the engine.
Create a model directory for Mistral Models
cd model
mkdir mistral_model
cd mistral_model
#Create the relevant directories
mkdir engine model_checkpoints tokenizer
Download tokenizer files in model/mistral_model/tokenizer direcotry
cd model/mistral_model/tokenizer
#Use curl to download the tokenizer files
"C:\Windows\System32\curl.exe" -L -o config.json "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=mistral7b_hf_tokenizer/config.json"
"C:\Windows\System32\curl.exe" -L -o tokenizer.json "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=mistral7b_hf_tokenizer/tokenizer.json"
"C:\Windows\System32\curl.exe" -L -o tokenizer.model "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=mistral7b_hf_tokenizer/tokenizer.model"
"C:\Windows\System32\curl.exe" -L -o tokenizer_config.json "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=mistral7b_hf_tokenizer/tokenizer_config.json"
Download Mistral awq int4 engine checkpoints in model/mistral_model/model_checkpoints folder
cd model/mistral_model/model_checkpoints
#Use curl to download the model checkpoint files files
"C:\Windows\System32\curl.exe" -L -o config.json "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=config.json"
"C:\Windows\System32\curl.exe" -L -o license.txt "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=license.txt"
"C:\Windows\System32\curl.exe" -L -o rank0.safetensors "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=rank0.safetensors"
"C:\Windows\System32\curl.exe" -L -o README.txt "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/llama/mistral-7b-int4-chat/1.2/files?redirect=true&path=README.txt"
Build the Mistral TRT-LLM int4 AWQ Engine
#inside the root directory
trtllm-build --checkpoint_dir .\model\mistral_model\model_checkpoints --output_dir .\model\mistral_model\engine --gpt_attention_plugin float16 --gemm_plugin float16 --max_batch_size 1 --max_input_len 7168 --max_output_len 1024 --context_fmha=enable --paged_kv_cache=disable --remove_input_padding=disable
We use the following directories that we previously created for the build command: | Name | Details |
---|---|---|
--checkpoint_dir | TRT-LLM checkpoints direcotry | |
--output_dir | TRT-LLM engine direcotry |
Refer to the TRT-LLM repository to learn more about the various commands and parameters.
Create the directories to store the Whisper model
cd model
mkdir whisper
cd whisper
#Create the relevant directories
mkdir whisper_assets whisper_medium_int8_engine
Download model weights and tokenizer
cd model/whisper/whisper_assets
#Use curl to download the tokenizer and model weights files
"C:\Windows\System32\curl.exe" -L -o mel_filters.npz "https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz"
"C:\Windows\System32\curl.exe" -L -o multilingual.tiktoken "https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/multilingual.tiktoken"
"C:\Windows\System32\curl.exe" -L -o medium.pt "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt"
Build command
# call command form root_dir
python .\whisper\build_files\build.py --output_dir .\model\whisper\whisper_medium_int8_engine --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --enable_context_fmha --max_batch_size 1 --max_beam_width 1 --model_name medium --use_weight_only --model_dir .\model\whisper\whisper_assets
We use the following directories that we previously created for the build command: | Name | Details |
---|---|---|
--checkpoint_dir | TRT-LLM checkpoints direcotry | |
--output_dir | TRT-LLM engine direcotry |
Refer to the TRT-LLM repository to learn more about the various commands and parameters.
Make the below direcotry structure in model folder
cd model
mkdir multilingual-e5-base
Download the below 'multilingual-e5-base' embedding model file from here
files to download: 1_Pooling/config.json, commit.txt, config.json, model.safetensors, modules.json, README.md, sentence_bert_config.json, sentencepiece.bpe.model, special_tokens_map.json, tokenizer.json, tokenizer_config.json
Building above two models are sufficient to run the app. Other models can be downloaded and built after running the app.
Running below commands would launch the UI of app in your browser
# call command form root_dir
python verify_install.py
python app.py
You can refer to User Guide for additional information on using the app.
In case any model is not needed, model can be removed by:
The following known issues exist in the current version:
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.