cmacdonald / pyterrier_bert

7 stars 8 forks source link

Error in CEDR.fit using trec-deep-learning-passages dataset #5

Open Tooba-ts1700550 opened 3 years ago

Tooba-ts1700550 commented 3 years ago

I am trying to follow the commands in the tutorial for CEDR on the trec-deep-learning-passages dataset. The indexing is successfully completed using Pyterrier docs. At the following line I get an error:

cedrpipe.fit(topicsTrain, qrelsTrain, topicsValid, qrelsTrain) Error stacktrace:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'body'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 'body'
cmacdonald commented 3 years ago

Are you sure that your BatchRetrieve is returning metadata, c.f. DPH_br = pt.BatchRetrieve(index, controls={"wmodel" : "DPH"}, verbose=True, metadata=["docno", "body"]) ?

Tooba-ts1700550 commented 3 years ago

I tried removing metadata=["docno", "body"], and I got this error on the same line

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'body'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 'body'

I also tried metadata=["docno", "text"], still same error.

cmacdonald commented 3 years ago

Can you look at the output of batchretrieve - does it have a text column when you do a search?

See also this doc for how to index and retrieve with text: https://pyterrier.readthedocs.io/en/latest/text.html

Tooba-ts1700550 commented 3 years ago

Actually, I used the same code as before and it was working with the other experiments I was doing, only with CEDR I get this error.

def msmarco_generate():
    dataset = pt.get_dataset("trec-deep-learning-passages")
    with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
        for l in corpusfile:
            docno, passage = l.split("\t")
            yield {'docno' : docno, 'text' : passage}

iter_indexer = pt.IterDictIndexer("./passage_index")
indexref = iter_indexer.index(msmarco_generate(), meta={'docno' : 20, 'text': 4096})

index = pt.IndexFactory.of(indexref)
trecdl = pt.get_dataset("trec-deep-learning-passages")

indexloc= "./passage_index/data.properties"
qrelsTest = trecdl.get_qrels("test-2020")
qrelsTrain = trecdl.get_qrels("train")
#take 1000 topics for training
topicsTrain = trecdl.get_topics("train").head(1000)
#take 50 topics for validation
topicsValid = trecdl.get_topics("train").iloc[1001:1050]
#this one-liner removes topics from the test set that do not have relevant documents
topicsTest = trecdl.get_topics("test-2020").merge(qrelsTest[qrelsTest["label"] > 0][["qid"]].drop_duplicates())

from pyterrier_bert.pyt_cedr import CEDRPipeline

DPH_br = pt.BatchRetrieve(index, controls={"wmodel" : "DPH"}, verbose=True)
DPH_br_qe = pt.BatchRetrieve(index, controls={"wmodel" : "DPH", "qe" : "on"}, verbose=True)

cedrpipe = DPH_br >> CEDRPipeline(max_valid_rank=20)
# training, this uses validation set to apply early stopping
cedrpipe.fit(topicsTrain, qrelsTrain, topicsValid, qrelsTrain)

# testing performance
pt.pipelines.Experiment(topicsTest, 
                        [DPH_br, DPH_qe, cedrpipe],
                        ['map', 'ndcg'], 
                        qrelsTest, 
                        names=["DPH", "DPH + CEDR BERT"])
cmacdonald commented 3 years ago

I think this is mismatch in attributes - CEDRPipeline expects 'body' rather than 'text. You can change it in the constructor. See https://github.com/cmacdonald/pyterrier_bert/blob/master/pyterrier_bert/pyt_cedr.py#L10

Tooba-ts1700550 commented 3 years ago

Ok I changed "body" to "text" in Line#10, Do I need to change somewhere else also? Because now the keyword error is on text.

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'text'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
6 frames
<ipython-input-5-58684445261e> in <module>()
      6 cedrpipe = DPH_br >> CEDRPipeline(max_valid_rank=20)
      7 # training, this uses validation set to apply early stopping
----> 8 cedrpipe.fit(topicsTrain, qrelsTrain, topicsValid, qrelsTrain)
      9 
     10 # testing performance

/usr/local/lib/python3.7/dist-packages/pyterrier/transformer.py in fit(self, topics_or_res_tr, qrels_tr, topics_or_res_va, qrels_va)
    816         for m in self.models:
    817             if isinstance(m, EstimatorBase):
--> 818                 m.fit(topics_or_res_tr, qrels_tr, topics_or_res_va, qrels_va)
    819             else:
    820                 topics_or_res_tr = m.transform(topics_or_res_tr)

/usr/local/lib/python3.7/dist-packages/pyterrier_bert/pyt_cedr.py in fit(self, tr, qrelsTrain, va, qrelsValid)
     54         train_run = self._make_cedr_run(tr, qrelsTrain)
     55         valid_run = self._make_cedr_run(va, qrelsValid)
---> 56         dataset = self._make_cedr_dataset(tr.append(va))
     57 
     58         import torch

/usr/local/lib/python3.7/dist-packages/pyterrier_bert/pyt_cedr.py in _make_cedr_dataset(self, table)
     20         for index, row in table.iterrows():
     21             queries[row['qid']] = row['query']
---> 22             docs[row['docno']] = row[self.doc_attr]
     23         dataset=(queries, docs)
     24         return dataset

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if is_hashable(key):

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
    988 
    989         # Similar to Index.get_value, but we do not fall back to positional
--> 990         loc = self.index.get_loc(label)
    991         return self.index._get_values_for_loc(self, loc, label)
    992 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 'text'
cmacdonald commented 3 years ago

I asked earlier:

Can you look at the output of batchretrieve - does it have a text column when you do a search?

You will need metadata=["docno", "text"]

Tooba-ts1700550 commented 3 years ago

Thank you very much for your help! It works now. How can I save the trained model cedrpipe and load it later to evaluate?

cmacdonald commented 3 years ago

CEDRPipeline has a save() and load() https://github.com/cmacdonald/pyterrier_bert/blob/master/pyterrier_bert/pyt_cedr.py#L34

You can reconstruct the pipeline and then apply load() on the CEDRPipeline object.

Pull Requests gratefully received to improve documentation

Tooba-ts1700550 commented 3 years ago

I triedcedrpipe.save(filename), getting the following error, although I can see the function save() in pyt_cedr.py :

AttributeError                            Traceback (most recent call last)
<ipython-input-11-98c9aab0f0a0> in <module>()
----> 1 cedrpipe.save(cedr_dph)

AttributeError: 'ComposedPipeline' object has no attribute 'save'
cmacdonald commented 3 years ago

you have to use the actual CEDR object. I.e. in DPH_br >> CEDRPipeline(max_valid_rank=20), its the second object you want to invoke the save on.

If you dont have a reference, you should be able to recover it using cedrpipe[1] (assuming your pipeline has only 2 stages -- with multi-stage pipelines things are more difficult)