This is the official repository for our preprint: Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories. 2022.
In this work, we conduct a large-scale knowledge probing of 10 language models (GPT-Neo series, OPT series and GPT-3 series) and 4 retrieval-augmentation approaches (BM25, Contriever, GenRead and vanilla), using our new open-domain QA dataset, PopQA.
We further introduce a simple-yet-effective method, Adaptive Retrieval, which adaptively retrieves and incorporates non-parametric memories when necessary. Our experimental results show that Adaptive Retrieval is not only more competitive but also more efficient in terms of inference-time latency as well as the GPT-3 API cost.
For any questions about the paper or the code, please contact the first authors or leave issues. If you find our code or paper useful, please cite the paper:
@article{ mallen2023llm_memorization ,
title={When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories },
author={ Mallen, Alex and Asai,Akari and Zhong, Victor and Das, Rajarshi and Hajishirzi, Hannaneh and Khashabi, Daniel},
journal={ arXiv preprint },
year={ 2022 }
}
pip install -r requirements.txt
We construct an entity-centric open-domain QA dataset, consisting of 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship type information.
{'id': 4222362, 'subj': 'George Rankin', 'prop': 'occupation', 'obj': 'politician', 'subj_id': 1850297, 'prop_id': 22, 'obj_id': 2834605, 's_aliases': '["George James Rankin"]', 'o_aliases': '["political leader","political figure","polit.","pol"]', 's_uri': 'http://www.wikidata.org/entity/Q5543720', 'o_uri': 'http://www.wikidata.org/entity/Q82955', 's_wiki_title': 'George Rankin', 'o_wiki_title': 'Politician', 's_pop': 142, 'o_pop': 25692, 'question': "What is George Rankin's occupation?", 'possible_answers': '["politician", "political leader", "political figure", "polit.", "pol"]'}
The data is available at data.
PopQA is also available available at huggingface datasets: akariasai/PopQA
import datasets
popqa = datasets.load_dataset("akariasai/PopQA")["test"]
You can reproduce our zero-shot prompting experiments by running the command below:
python run_model.py \
--model_name MODEL_NAME \
--input_file data/popQA.tsv \
--eval_method vanilla
We use the int8bit quantization to run GPT-Neox-20B and OPT-13B in our environment (a single V100 Volta 32 GB GRAM).
python run_model.py \
--model_name EleutherAI/gpt-neox-20b \
--input_file data/popQA.tsv \
--eval_method vanilla \
--int8bit
To run retrieval-augmented LMs using BM25 or Contriever, please download the retrieval results here.
Then, you can run the retrieval-augmented baselines as follows:
python run_model.py \
--model_name MODEL_NAME \
--input_file data/popQA.tsv \
--eval_method contriever \
--ret_file PATH_TO_RETRIEVED_DOCUMENTS.jsonl
To run GenRead, you don't need to specify the retrieval file path.
python run_model.py \
--model_name MODEL_NAME \
--input_file data/popQA.tsv \
--eval_method genread
See the adaptive-retrieval.ipynb
notebook, where you can point to the results files (obtained from run_model.py
) for a parametric (vanilla, GenRead) and non-parametric (BM25, Contriever) evaluation, which will be used to compute adaptive results.