capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

[Proposal] Adding 'metrics' config to reranker task #142

Closed ali-abz closed 3 years ago

ali-abz commented 3 years ago

Hi there, I noticed that when defining a rank task, the user may define a list of metrics to be evaluated and reported. But there is no such config for rerank tasks. I am assuming that this feature is not missing by purpose and there is no harm in adding it. If that is the case, please do consider adding this config to rerankers.

ali-abz commented 3 years ago

I believe the only use of such config in reranker task is here:

https://github.com/capreolus-ir/capreolus/blob/0121f6e7efa3c1f19cc4704ac6f69747e1baa028/capreolus/task/rerank.py#L223

So the config parsing should happen around here. I would be glad if you let me to submit a pull request and check if it is OK.

andrewyates commented 3 years ago

Hi Ali, this sounds good to me.

ali-abz commented 3 years ago

Hi Andrew. So I made a small amount of changes to the reranker.py and It is fine, almost :) There is a small problem here. when reranker is bound to a list of metrics, there will be reports generated by trainer (like pytorch) that uses default metrics. Please see these logs:


2021-03-29 23:26:45,332 - INFO - capreolus.trainer.pytorch.train - A single iteration takes 0.06088113784790039
2021-03-29 23:26:45,332 - INFO - capreolus.trainer.pytorch.train - iter = 2 loss = 0.000000
2021-03-29 23:26:47,417 - INFO - capreolus.trainer.pytorch.train - dev metrics: P_1=0.359 P_10=0.248 P_20=0.124 P_5=0.295 judged_10=0.292 judged_20=0.292 judged_200=0.292 map=0.085 ndcg_cut_10=0.292 ndcg_cut_20=0.224 ndcg_cut_5=0.304 recall_100=0.138 recall_1000=0.138 recip_rank=0.464
2021-03-29 23:26:47,417 - INFO - capreolus.trainer.pytorch.train - new best dev metric: 0.0850
2021-03-29 23:26:53,718 - INFO - capreolus.trainer.pytorch.train - A single iteration takes 0.059545278549194336
2021-03-29 23:26:53,718 - INFO - capreolus.trainer.pytorch.train - iter = 3 loss = 0.000000
2021-03-29 23:26:55,814 - INFO - capreolus.trainer.pytorch.train - dev metrics: P_1=0.352 P_10=0.248 P_20=0.124 P_5=0.295 judged_10=0.292 judged_20=0.292 judged_200=0.292 map=0.085 ndcg_cut_10=0.292 ndcg_cut_20=0.223 ndcg_cut_5=0.303 recall_100=0.138 recall_1000=0.138 recip_rank=0.462
2021-03-29 23:27:02,110 - INFO - capreolus.trainer.pytorch.train - A single iteration takes 0.05727958679199219
2021-03-29 23:27:02,110 - INFO - capreolus.trainer.pytorch.train - iter = 4 loss = 0.000000
2021-03-29 23:27:04,206 - INFO - capreolus.trainer.pytorch.train - dev metrics: P_1=0.349 P_10=0.248 P_20=0.124 P_5=0.291 judged_10=0.292 judged_20=0.292 judged_200=0.292 map=0.085 ndcg_cut_10=0.292 ndcg_cut_20=0.223 ndcg_cut_5=0.300 recall_100=0.138 recall_1000=0.138 recip_rank=0.461
2021-03-29 23:27:04,207 - INFO - capreolus.trainer.pytorch.train - new best dev metric: 0.0852
2021-03-29 23:27:10,503 - INFO - capreolus.trainer.pytorch.train - A single iteration takes 0.05933666229248047
2021-03-29 23:27:10,503 - INFO - capreolus.trainer.pytorch.train - iter = 5 loss = 0.000000
2021-03-29 23:27:12,603 - INFO - capreolus.trainer.pytorch.train - dev metrics: P_1=0.342 P_10=0.248 P_20=0.124 P_5=0.286 judged_10=0.292 judged_20=0.292 judged_200=0.292 map=0.085 ndcg_cut_10=0.291 ndcg_cut_20=0.223 ndcg_cut_5=0.297 recall_100=0.138 recall_1000=0.138 recip_rank=0.457
2021-03-29 23:27:18,841 - INFO - capreolus.trainer.pytorch.train - training loss: [0.6238575577735901, 0.0, 0.0, 0.0, 0.0]
2021-03-29 23:27:18,841 - INFO - capreolus.trainer.pytorch.train - Training took 43.153306007385254
2021-03-29 23:27:25,144 - INFO - capreolus.task.rank.evaluate - rank: fold=s1 best run: /home/aliabedzadeh/.capreolus/results/collection-nf/benchmark-nf_fields-all_titles_labelrange-0-2/collection-nf/index-anserini_indexstops-False_stemmer-porter/searcher-BM25_b-0.8_fields-title_hits-1000_k1-0.9/task-rank_filter-False/searcher
2021-03-29 23:27:25,144 - INFO - capreolus.task.rank.evaluate - rank: cross-validated results when optimizing for 'map':
2021-03-29 23:27:25,144 - INFO - capreolus.task.rank.evaluate -                      P_13: 0.2065
2021-03-29 23:27:25,145 - INFO - capreolus.task.rank.evaluate -                       map: 0.1520
2021-03-29 23:27:25,913 - INFO - capreolus.task.rerank.evaluate - rerank: fold=s1 dev metrics: P_13=0.191 map=0.085
2021-03-29 23:27:25,987 - INFO - capreolus.task.rerank.evaluate - rerank: fold=s1 test metrics: P_13=0.188 map=0.117
2021-03-29 23:27:25,988 - INFO - capreolus.task.rerank.evaluate - rerank: average cross-validated metrics when choosing iteration base on 'map':
2021-03-29 23:27:30,672 - INFO - capreolus.task.rerank.evaluate -                      P_13: 0.1883
2021-03-29 23:27:30,673 - INFO - capreolus.task.rerank.evaluate -                       map: 0.1174
2021-03-29 23:27:30,673 - INFO - capreolus.task.rerank.evaluate - interpolated with alphas = [1.0]
2021-03-29 23:27:30,673 - INFO - capreolus.task.rerank.evaluate -             P_13 [interp]: 0.2065
2021-03-29 23:27:30,673 - INFO - capreolus.task.rerank.evaluate -              map [interp]: 0.1520

You can see reports like:

INFO - capreolus.trainer.pytorch.train - dev metrics: P_1=0.342 P_10=0.248 P_20=0.124 P_5=0.286 judged_10=0.292 judged_20=0.292 judged_200=0.292 map=0.085 ndcg_cut_10=0.291 ndcg_cut_20=0.223 ndcg_cut_5=0.297 recall_100=0.138 recall_1000=0.138 recip_rank=0.457

this reports can be altered by passing the user defined metrics to the trainer but it will cause some errors:

Traceback (most recent call last):
  File "test.py", line 31, in <module>
    pre = task.train()
  File "/home/aliabedzadeh/miniconda3/envs/capreolus/lib/python3.8/site-packages/capreolus/task/rerank.py", line 48, in train
    return self.rerank_run(best_search_run, self.get_results_path())
  File "/home/aliabedzadeh/miniconda3/envs/capreolus/lib/python3.8/site-packages/capreolus/task/rerank.py", line 86, in rerank_run
    dev_preds = self.reranker.trainer.train(
  File "/home/aliabedzadeh/miniconda3/envs/capreolus/lib/python3.8/site-packages/capreolus/trainer/pytorch.py", line 261, in train
    summary_writer.add_scalar("ndcg_cut_20", metrics["ndcg_cut_20"], niter)
KeyError: 'ndcg_cut_20'

summary_writer is trying to log a metric that is not defined by the user and hence, is not calculated.

What would be the good approach here? Should we leave the trainer logs to its own or is there a way to generalize summary_writer?

andrewyates commented 3 years ago

I think the issue here is that the trainers specify their own metrics separately. The trainers' metrics are used for measuring performance on the dev set in order to pick an epoch, whereas the rerank task's metrics are on the test set after cross-validation.

If you don't need to change the trainers' metrics for your use case, I think it's fine to leave them on their own. Does this work for you though?

ali-abz commented 3 years ago

That totally works for me. I just needed a final report on some metrics for my reranker. The metric that should be optimized will be passed to the trainer so there wouldn't be any problem regarding that too. I believe this will bring an end to this issue. Thanks.