QpxDesign/TREC-iKAT - Githubissues

TREC-iKAT

Resources

FastChat T5 - large language model thats much better at summarization than LLAMA2
Sentance-Transformers - a Python framework for state-of-the-art sentence, text and image embeddings, based on BERT
Pyserini - a Python toolkit for reproducible information retrieval research with sparse and dense representations (currently using to get passages from query, using BM25-based searcher)
LLAMA2 - large language model with a built-in chat model, on-par with ChatGPT (using 7B params chat rn)
WikiText - The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Produced by Cloudflare's Einstien AI Research Lab, we're using it to train text classifer on what kind of passages to look for
scikit-learn - toolkit for AI based classification, analysis, and more, currently we're using it to determine reliability of passages

Notes:

install pip dependencies: pip install -r requirements.txt MacOS instructions of llama.cpp python

run bash scripts/install-data.sh to install llama model (13B-Chat) from my server (~8 GB)

run python3 -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0 to install fastchat-t5 model (~7 GB)

running ptkb_similarity or rank_passage_sentances (both are run in full-run) will download several BERT models (a few GB total)

set PYTHON_PATH variable: export PYTHONPATH=$PWD:$PYTHONPATH

install wikitext-103 from Cloudflare Einstien (~600MB Uncompressed), unzip it, and drag it into data/text-classification (used to train passage classifier)

install news articles corpus from Kaggle (~2GB) and place it in data/news-articles (used to train 'less strict' passage classifier)

if you're using ChatGPT instead of Llama, generate an OPENAI API Key, create a .env in the main dir, and put your key in it: OPENAI_API_KEY=<KEY GOES HERE> please note that a full-run using ChatGPT may use ~$1-3 of credit

System Requirements

This was tested/developed/ran from a computer running Ubuntu 22.04 with an RTX 3080 (10GB Version), an Intel i7-11700K (16 total CPU threads). Runs took between 3 and 28 hours, depending mostly on which LLM was used to generate the final responses. There are some tweaks you may need to make if you have different hardware:

Change n_threads in utils/llama2.py to however many CPU threads you have (if you're running on CPU)
- remove n_gpu_layers=30 if you're not running on GPU
change references to "device=cuda" in utils/ptkb_similarity.py and utils/rank_passage_sentences.py
switch which model of LLaMa you're running depending on your system's capabilities:
- Use quantized versions of LLaMa2, and, if you want to run on GPU or you have limited RAM, make sure you have more ram than the listed RAM usage (if you have too little VRAM to run a model but enough ram, uninstall llama-cpp-python and reinstall without GPU support: pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir)
- quantized llama2 7B chat
- quantized llama2 13B chat
- quantized llama2 70B chat

Running Pyserini with clueweb22

Place ikat collections (named ikat_collection_2023_0n.json) into /data/clueweb/
Format by running bash scripts/format_ikat_collection.sh (this will take a long time)
Generate the index: python -m pyserini.index.lucene --collection JsonCollection --input ~/TREC-iKAT/data/clueweb --index indexes/ikat_collection_2023 --generator DefaultLuceneDocumentGenerator --threads 1 --storePositions --storeDocvectors --storeRaw

Helpful Commands

Format large output JSON files with cat output/SEP1_RUN_1-2.json | python -m json.tool > output/STAR3.json

Runs We're Submitting (SUBJECT TO CHANGE) (IN ORDER)

(Automatic) 2 shot approach
(Automatic) 1 shot approach
(Automatic) 1 shot approach without TF-IDF Reliability Model
(Manual, using provided PTKBs and resolved_utterance) using 1 shot approach, same as (1)