Closed Victor0118 closed 5 years ago
Can you briefly describe what is Matchzoo and why the "integration" is needed?
Matchzoo is a toolkit for sentence pair modeling. (See https://github.com/faneshion/MatchZoo) It provides various NN models to predict the relevance scores of two sentences. Since NN becomes more and more popular in the IR area nowadays, I want to combine Anserini and Matchzoo to build a rank-reranking pipeline. Ranking in Anserini provides BM25, QL ... to retrieval documents from large corpus and reranking in Matchzoo accepts small and well labeled sentence pairs to predict new relevance scores.
Since the setup is different, there are some issues to be considered. For example, there is no training set in most TREC tracks. I am not familar with all tracks so I will start with CORE17 to see how it works.
Since you said you are going to use a shell script to run the process, I assume you are not targeting end-2-end? Then I think what you need is probably a more power Python interface for Anserini?
Yes. It is only possible to Implement the end2end framework when we have a well-trained(offline) reranking model. My current plan is to generate the training data for reranking model training using Anserini.
Yes. I'll try my best to do all of these in Python.
I train the NN models using the golden labeled data from qrels
files in MB2011, MB2012 and MB2013 and test the NN models on the run
file generated by QL on MB2014. The results of the reranking are shown below:
Models | MAP | MRR | P@30 |
---|---|---|---|
QL | 0.4184 | 0.8408 | 0.6424 |
DSSM | 0.2030 | 0.5977 | 0.3224 |
CDSSM | 0.1599 | 0.3481 | 0.2182 |
DUET | 0.2606 | 0.4551 | 0.3782 |
KNRM | 0.2433 | 0.4702 | 0.3521 |
DRMM | 0.3204 | 0.6642 | 0.4739 |
After preprocessing and fine tuning:
Models | MAP | MRR | P@30 |
---|---|---|---|
QL | 0.4184 | 0.8408 | 0.6424 |
DSSM | 0.2481 | 0.5950 | 0.3921 |
CDSSM | 0.1914 | 0.4192 | 0.2655 |
DUET | 0.2658 | 0.4730 | 0.3879 |
KNRM | 0.2984 | 0.7005 | 0.4570 |
DRMM | 0.3778 | 0.7386 | 0.5588 |
Conv-KNRM | 0.3014 | 0.5955 | 0.4194 |
Compared to the baseline paper,
Better: DRMM Worse: KNRM Almost the same: DSSM, CDSSM, DUET
So none of these reranking approaches improve over QL?
@arjenpdevries For my experiment on MB corpus, yes.
More papers on ad-hoc retrieval: https://arxiv.org/abs/1606.04648 (Robust04) https://arxiv.org/abs/1711.08611 (Robust04) https://arxiv.org/abs/1805.05737 (MQ2007, MQ2008) https://arxiv.org/abs/1706.06613 (search logs of Sougou.com) http://aclweb.org/anthology/P18-1223 (search logs of Sougou.com) http://www.cs.wayne.edu/kotov/docs/balaneshinkordan-cikm18.pdf (GOV2 DBpedia-v2 HomeDepot) http://ls3.rnet.ryerson.ca/wiki/images/2/29/NeuralEmbeddings_IPM.pdf (ClueWeb’09B and ClueWeb’12B)
I split the 250 topics in Robust04 into train/dev/test sets (200 for training, 25 for dev and 25 for test). Embedding is pretrained on Robust04 corpus using word2vec. For DRMM I use the same parameters described in the DRMM paper.
I train the NN models using the golden labeled data from qrels
files and test the NN models on the run
file generated by SimpleSearcher on the test topics. The results of the reranking are shown below:
Models | MAP | MRR | P@30 |
---|---|---|---|
SimpleSearcher | 0.2671 | 0.7699 | 0.3813 |
DSSM | 0.0768 | 0.3315 | 0.0864 |
CDSSM | 0.0618 | 0.3419 | 0.0814 |
DUET | 0.0891 | 0.4103 | 0.1305 |
KNRM | 0.1517 | 0.6439 | 0.2498 |
DRMM (the same parameters in the paper) | 0.2458 | 0.7600 | 0.3453 |
DRMM (fine tuned) | 0.2731 | 0.7717 | 0.3832 |
So the results basically match the numbers in the DRMM paper. I think NNs except DRMM fails to leverage the matching information of the whole passage (All of them truncate the document at first 500 words).
Hi @arjenpdevries !
BTW, http://desires.dei.unipd.it/papers/paper10.pdf reports some nice numbers for Robust04.
After interpolation:
Models | MAP | MRR | P@30 |
---|---|---|---|
QL | 0.4184 | 0.8408 | 0.6424 |
DSSM+ (lambda=0) | 0.4184 | 0.8408 | 0.6424 |
CDSSM+ (lambda=0.1) | 0.4229 | 0.8531 | 0.6442 |
DUET+ (lambda=0.2) | 0.4476 | 0.8364 | 0.6606 |
KNRM+ (lambda=0.1) | 0.4389 | 0.8456 | 0.6564 |
DRMM+ (lambda=0.3) | 0.4475 | 0.8708 | 0.6448 |
All results are higher than the baseline paper, because the NN reranking models accept the better QL(simple searcher) output.
Models | MAP | MRR | P@30 |
---|---|---|---|
SimpleSearcher | 0.2671 | 0.7699 | 0.3813 |
DSSM+ (lambda = 0) | 0.2671 | 0.7699 | 0.3813 |
CDSSM+ (lambda = 0) | 0.2671 | 0.7699 | 0.3813 |
DUET+ (lambda = 0.05) | 0.2735 | 0.7724 | 0.3954 |
KNRM+ (lambda = 0.15) | 0.2713 | 0.7719 | 0.3908 |
DRMM+ (the same parameters in the paper) (lambda = 0.3) | 0.2793 | 0.7784 | 0.3981 |
DRMM+ (fine tuned) (lambda = 0.35) | 0.2914 | 0.7859 | 0.4015 |
I add the connection between Anserini and PACRR (https://github.com/khui/copacrr).
Models | MAP | MRR | P@30 |
---|---|---|---|
QL | 0.4184 | 0.8408 | 0.6424 |
PACRR | 0.3980 | 0.7827 | 0.5576 |
PACRR+ |
It is pretty close to the QL baseline and might be better after more tuning.
Hi @Victor0118 and @lintool thank you for providing the integration with MatchZoo. I am looking at documents describing this work: https://github.com/castorini/anserini/blob/master/docs/document-matchzoo.md However, I am confused regarding where respective training scripts can be found.
In particular, I cannot find the following scripts:
prepare_mz_data.py
matchzoo/main.py
I also checked the castor
repository, but in vain. Is it some branch?
Could you guide me?
Many thanks!
Hi, @searchivarius. Thanks for your interest in this work.
This integration is between Anserini and Matchzoo v1.0. So you can find the script in the MatchZoo v1.0 here: https://github.com/NTMC-Community/MatchZoo/tree/1.0
I would suggest you use my repo since I have some updates based on the code above so that we can apply MatchZoo to Robust04 and Tweet datasets: https://github.com/Victor0118/MatchZoo/tree/rerank/data/robust04 and https://github.com/Victor0118/MatchZoo/tree/rerank/data/tweets
Hi @Victor0118 thanks a lot for the quick reply!
The basic idea is to transform the
run.*
file into the mz format required by Matchzoo. The target is to run the ranking and reranking models in a shell script. I hope they are all written in python. There are several points I am not clear:run.*
file? One option is pyjnius + Anserini such asHow to define train/test set? There should be no train/test splits in most TREC tracks since there are IR tasks but not ML tasks. My plan to allow user to DIY their own train/test split. For example, we can train on MB11 and test on MB13, or train on Robust04 and test on CORE17.
How to select negative samples? Some tracks provides negative samples like CORE17 but some only provide the relevant documents and all other documents are irrelevant like CAR17. We need to sample them. Can we use the query-doc pairs from
run.*
instead ofqrels.*
file as the training and test data for reranking? In CAR 17 I did both.How to select sentences? For most tracks, the documents are long texts, which will cause a big efficiency problem in neural network reranking. One basic approach to solve this is to select some representative sentences for each document by the tfidf matching score.