Open Tooba-ts1700550 opened 3 years ago
Are you sure that your BatchRetrieve is returning metadata, c.f.
DPH_br = pt.BatchRetrieve(index, controls={"wmodel" : "DPH"}, verbose=True, metadata=["docno", "body"])
?
I tried removing metadata=["docno", "body"]
, and I got this error on the same line
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'body'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 'body'
I also tried metadata=["docno", "text"], still same error.
Can you look at the output of batchretrieve - does it have a text column when you do a search?
See also this doc for how to index and retrieve with text: https://pyterrier.readthedocs.io/en/latest/text.html
Actually, I used the same code as before and it was working with the other experiments I was doing, only with CEDR I get this error.
def msmarco_generate():
dataset = pt.get_dataset("trec-deep-learning-passages")
with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
for l in corpusfile:
docno, passage = l.split("\t")
yield {'docno' : docno, 'text' : passage}
iter_indexer = pt.IterDictIndexer("./passage_index")
indexref = iter_indexer.index(msmarco_generate(), meta={'docno' : 20, 'text': 4096})
index = pt.IndexFactory.of(indexref)
trecdl = pt.get_dataset("trec-deep-learning-passages")
indexloc= "./passage_index/data.properties"
qrelsTest = trecdl.get_qrels("test-2020")
qrelsTrain = trecdl.get_qrels("train")
#take 1000 topics for training
topicsTrain = trecdl.get_topics("train").head(1000)
#take 50 topics for validation
topicsValid = trecdl.get_topics("train").iloc[1001:1050]
#this one-liner removes topics from the test set that do not have relevant documents
topicsTest = trecdl.get_topics("test-2020").merge(qrelsTest[qrelsTest["label"] > 0][["qid"]].drop_duplicates())
from pyterrier_bert.pyt_cedr import CEDRPipeline
DPH_br = pt.BatchRetrieve(index, controls={"wmodel" : "DPH"}, verbose=True)
DPH_br_qe = pt.BatchRetrieve(index, controls={"wmodel" : "DPH", "qe" : "on"}, verbose=True)
cedrpipe = DPH_br >> CEDRPipeline(max_valid_rank=20)
# training, this uses validation set to apply early stopping
cedrpipe.fit(topicsTrain, qrelsTrain, topicsValid, qrelsTrain)
# testing performance
pt.pipelines.Experiment(topicsTest,
[DPH_br, DPH_qe, cedrpipe],
['map', 'ndcg'],
qrelsTest,
names=["DPH", "DPH + CEDR BERT"])
I think this is mismatch in attributes - CEDRPipeline expects 'body' rather than 'text. You can change it in the constructor. See https://github.com/cmacdonald/pyterrier_bert/blob/master/pyterrier_bert/pyt_cedr.py#L10
Ok I changed "body" to "text" in Line#10, Do I need to change somewhere else also? Because now the keyword error is on text.
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'text'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
6 frames
<ipython-input-5-58684445261e> in <module>()
6 cedrpipe = DPH_br >> CEDRPipeline(max_valid_rank=20)
7 # training, this uses validation set to apply early stopping
----> 8 cedrpipe.fit(topicsTrain, qrelsTrain, topicsValid, qrelsTrain)
9
10 # testing performance
/usr/local/lib/python3.7/dist-packages/pyterrier/transformer.py in fit(self, topics_or_res_tr, qrels_tr, topics_or_res_va, qrels_va)
816 for m in self.models:
817 if isinstance(m, EstimatorBase):
--> 818 m.fit(topics_or_res_tr, qrels_tr, topics_or_res_va, qrels_va)
819 else:
820 topics_or_res_tr = m.transform(topics_or_res_tr)
/usr/local/lib/python3.7/dist-packages/pyterrier_bert/pyt_cedr.py in fit(self, tr, qrelsTrain, va, qrelsValid)
54 train_run = self._make_cedr_run(tr, qrelsTrain)
55 valid_run = self._make_cedr_run(va, qrelsValid)
---> 56 dataset = self._make_cedr_dataset(tr.append(va))
57
58 import torch
/usr/local/lib/python3.7/dist-packages/pyterrier_bert/pyt_cedr.py in _make_cedr_dataset(self, table)
20 for index, row in table.iterrows():
21 queries[row['qid']] = row['query']
---> 22 docs[row['docno']] = row[self.doc_attr]
23 dataset=(queries, docs)
24 return dataset
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if is_hashable(key):
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
988
989 # Similar to Index.get_value, but we do not fall back to positional
--> 990 loc = self.index.get_loc(label)
991 return self.index._get_values_for_loc(self, loc, label)
992
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 'text'
I asked earlier:
Can you look at the output of batchretrieve - does it have a text column when you do a search?
You will need metadata=["docno", "text"]
Thank you very much for your help! It works now. How can I save the trained model cedrpipe and load it later to evaluate?
CEDRPipeline has a save() and load() https://github.com/cmacdonald/pyterrier_bert/blob/master/pyterrier_bert/pyt_cedr.py#L34
You can reconstruct the pipeline and then apply load() on the CEDRPipeline object.
Pull Requests gratefully received to improve documentation
I triedcedrpipe.save(filename)
, getting the following error, although I can see the function save() in pyt_cedr.py :
AttributeError Traceback (most recent call last)
<ipython-input-11-98c9aab0f0a0> in <module>()
----> 1 cedrpipe.save(cedr_dph)
AttributeError: 'ComposedPipeline' object has no attribute 'save'
you have to use the actual CEDR object. I.e. in DPH_br >> CEDRPipeline(max_valid_rank=20)
, its the second object you want to invoke the save on.
If you dont have a reference, you should be able to recover it using cedrpipe[1]
(assuming your pipeline has only 2 stages -- with multi-stage pipelines things are more difficult)
I am trying to follow the commands in the tutorial for CEDR on the trec-deep-learning-passages dataset. The indexing is successfully completed using Pyterrier docs. At the following line I get an error:
cedrpipe.fit(topicsTrain, qrelsTrain, topicsValid, qrelsTrain)
Error stacktrace: