install pip dependencies: pip install -r requirements.txt
MacOS instructions of llama.cpp python
run bash scripts/install-data.sh
to install llama model (13B-Chat) from my server (~8 GB)
run python3 -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0
to install fastchat-t5 model (~7 GB)
running ptkb_similarity or rank_passage_sentances (both are run in full-run) will download several BERT models (a few GB total)
set PYTHON_PATH variable: export PYTHONPATH=$PWD:$PYTHONPATH
install wikitext-103 from Cloudflare Einstien (~600MB Uncompressed), unzip it, and drag it into data/text-classification (used to train passage classifier)
install news articles corpus from Kaggle (~2GB) and place it in data/news-articles (used to train 'less strict' passage classifier)
if you're using ChatGPT instead of Llama, generate an OPENAI API Key, create a .env in the main dir, and put your key in it: OPENAI_API_KEY=<KEY GOES HERE>
please note that a full-run using ChatGPT may use ~$1-3 of credit
This was tested/developed/ran from a computer running Ubuntu 22.04 with an RTX 3080 (10GB Version), an Intel i7-11700K (16 total CPU threads). Runs took between 3 and 28 hours, depending mostly on which LLM was used to generate the final responses. There are some tweaks you may need to make if you have different hardware:
utils/llama2.py
to however many CPU threads you have (if you're running on CPU)
"device=cuda"
in utils/ptkb_similarity.py
and utils/rank_passage_sentences.py
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
)ikat_collection_2023_0n.json
) into /data/clueweb/bash scripts/format_ikat_collection.sh
(this will take a long time)python -m pyserini.index.lucene --collection JsonCollection --input ~/TREC-iKAT/data/clueweb --index indexes/ikat_collection_2023 --generator DefaultLuceneDocumentGenerator --threads 1 --storePositions --storeDocvectors --storeRaw
Format large output JSON files with cat output/SEP1_RUN_1-2.json | python -m json.tool > output/STAR3.json