castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.02k stars 449 forks source link

Integration between Anserini and MatchZoo #420

Closed Victor0118 closed 5 years ago

Victor0118 commented 6 years ago

The basic idea is to transform the run.* file into the mz format required by Matchzoo. The target is to run the ranking and reranking models in a shell script. I hope they are all written in python. There are several points I am not clear:

  1. How to get the raw documents from the document ID in the run.* file? One option is pyjnius + Anserini such as
import os
os.environ['CLASSPATH'] = "/home/larumuga/Anserini/target/anserini-0.0.1-SNAPSHOT.jar"

import jnius_config
from jnius import autoclass
JString = autoclass('java.lang.String')

index_test = autoclass('io.anserini.index.IndexUtils')
indexes = index_test(JString('/home/w85yang/Anserini/lucene-index-all.car18'))

print(indexes.getRawDocument(JString('7250e1b901bb59853deb38a452f9009999e790ae')
  1. How to define train/test set? There should be no train/test splits in most TREC tracks since there are IR tasks but not ML tasks. My plan to allow user to DIY their own train/test split. For example, we can train on MB11 and test on MB13, or train on Robust04 and test on CORE17.

  2. How to select negative samples? Some tracks provides negative samples like CORE17 but some only provide the relevant documents and all other documents are irrelevant like CAR17. We need to sample them. Can we use the query-doc pairs from run.* instead of qrels.* file as the training and test data for reranking? In CAR 17 I did both.

  3. How to select sentences? For most tracks, the documents are long texts, which will cause a big efficiency problem in neural network reranking. One basic approach to solve this is to select some representative sentences for each document by the tfidf matching score.

Peilin-Yang commented 6 years ago

Can you briefly describe what is Matchzoo and why the "integration" is needed?

Victor0118 commented 6 years ago

Matchzoo is a toolkit for sentence pair modeling. (See https://github.com/faneshion/MatchZoo) It provides various NN models to predict the relevance scores of two sentences. Since NN becomes more and more popular in the IR area nowadays, I want to combine Anserini and Matchzoo to build a rank-reranking pipeline. Ranking in Anserini provides BM25, QL ... to retrieval documents from large corpus and reranking in Matchzoo accepts small and well labeled sentence pairs to predict new relevance scores.

Since the setup is different, there are some issues to be considered. For example, there is no training set in most TREC tracks. I am not familar with all tracks so I will start with CORE17 to see how it works.

Peilin-Yang commented 6 years ago

Since you said you are going to use a shell script to run the process, I assume you are not targeting end-2-end? Then I think what you need is probably a more power Python interface for Anserini?

Victor0118 commented 6 years ago

Yes. It is only possible to Implement the end2end framework when we have a well-trained(offline) reranking model. My current plan is to generate the training data for reranking model training using Anserini.

Yes. I'll try my best to do all of these in Python.

Victor0118 commented 5 years ago

Result on the MB 2011 - 2014 corpus

I train the NN models using the golden labeled data from qrels files in MB2011, MB2012 and MB2013 and test the NN models on the run file generated by QL on MB2014. The results of the reranking are shown below:

Models MAP MRR P@30
QL 0.4184 0.8408 0.6424
DSSM 0.2030 0.5977 0.3224
CDSSM 0.1599 0.3481 0.2182
DUET 0.2606 0.4551 0.3782
KNRM 0.2433 0.4702 0.3521
DRMM 0.3204 0.6642 0.4739
Victor0118 commented 5 years ago

After preprocessing and fine tuning:

Models MAP MRR P@30
QL 0.4184 0.8408 0.6424
DSSM 0.2481 0.5950 0.3921
CDSSM 0.1914 0.4192 0.2655
DUET 0.2658 0.4730 0.3879
KNRM 0.2984 0.7005 0.4570
DRMM 0.3778 0.7386 0.5588
Conv-KNRM 0.3014 0.5955 0.4194

Compared to the baseline paper,

Better: DRMM Worse: KNRM Almost the same: DSSM, CDSSM, DUET

arjenpdevries commented 5 years ago

So none of these reranking approaches improve over QL?

Victor0118 commented 5 years ago

@arjenpdevries For my experiment on MB corpus, yes.

More papers on ad-hoc retrieval: https://arxiv.org/abs/1606.04648 (Robust04) https://arxiv.org/abs/1711.08611 (Robust04) https://arxiv.org/abs/1805.05737 (MQ2007, MQ2008) https://arxiv.org/abs/1706.06613 (search logs of Sougou.com) http://aclweb.org/anthology/P18-1223 (search logs of Sougou.com) http://www.cs.wayne.edu/kotov/docs/balaneshinkordan-cikm18.pdf (GOV2 DBpedia-v2 HomeDepot) http://ls3.rnet.ryerson.ca/wiki/images/2/29/NeuralEmbeddings_IPM.pdf (ClueWeb’09B and ClueWeb’12B)

Victor0118 commented 5 years ago

Result on the Robust04 corpus

I split the 250 topics in Robust04 into train/dev/test sets (200 for training, 25 for dev and 25 for test). Embedding is pretrained on Robust04 corpus using word2vec. For DRMM I use the same parameters described in the DRMM paper.

I train the NN models using the golden labeled data from qrels files and test the NN models on the run file generated by SimpleSearcher on the test topics. The results of the reranking are shown below:

Models MAP MRR P@30
SimpleSearcher 0.2671 0.7699 0.3813
DSSM 0.0768 0.3315 0.0864
CDSSM 0.0618 0.3419 0.0814
DUET 0.0891 0.4103 0.1305
KNRM 0.1517 0.6439 0.2498
DRMM (the same parameters in the paper) 0.2458 0.7600 0.3453
DRMM (fine tuned) 0.2731 0.7717 0.3832

So the results basically match the numbers in the DRMM paper. I think NNs except DRMM fails to leverage the matching information of the whole passage (All of them truncate the document at first 500 words).

lintool commented 5 years ago

Hi @arjenpdevries !

BTW, http://desires.dei.unipd.it/papers/paper10.pdf reports some nice numbers for Robust04.

Victor0118 commented 5 years ago

Result on the MB 2011 - 2014 corpus

After interpolation:

Models MAP MRR P@30
QL 0.4184 0.8408 0.6424
DSSM+ (lambda=0) 0.4184 0.8408 0.6424
CDSSM+ (lambda=0.1) 0.4229 0.8531 0.6442
DUET+ (lambda=0.2) 0.4476 0.8364 0.6606
KNRM+ (lambda=0.1) 0.4389 0.8456 0.6564
DRMM+ (lambda=0.3) 0.4475 0.8708 0.6448

All results are higher than the baseline paper, because the NN reranking models accept the better QL(simple searcher) output.

Victor0118 commented 5 years ago

Result on the Robust04 corpus

Models MAP MRR P@30
SimpleSearcher 0.2671 0.7699 0.3813
DSSM+ (lambda = 0) 0.2671 0.7699 0.3813
CDSSM+ (lambda = 0) 0.2671 0.7699 0.3813
DUET+ (lambda = 0.05) 0.2735 0.7724 0.3954
KNRM+ (lambda = 0.15) 0.2713 0.7719 0.3908
DRMM+ (the same parameters in the paper) (lambda = 0.3) 0.2793 0.7784 0.3981
DRMM+ (fine tuned) (lambda = 0.35) 0.2914 0.7859 0.4015
Victor0118 commented 5 years ago

I add the connection between Anserini and PACRR (https://github.com/khui/copacrr).

Models MAP MRR P@30
QL 0.4184 0.8408 0.6424
PACRR 0.3980 0.7827 0.5576
PACRR+

It is pretty close to the QL baseline and might be better after more tuning.

searchivarius commented 5 years ago

Hi @Victor0118 and @lintool thank you for providing the integration with MatchZoo. I am looking at documents describing this work: https://github.com/castorini/anserini/blob/master/docs/document-matchzoo.md However, I am confused regarding where respective training scripts can be found.

In particular, I cannot find the following scripts:

prepare_mz_data.py
matchzoo/main.py

I also checked the castor repository, but in vain. Is it some branch?

Could you guide me?

Many thanks!

Victor0118 commented 5 years ago

Hi, @searchivarius. Thanks for your interest in this work.

This integration is between Anserini and Matchzoo v1.0. So you can find the script in the MatchZoo v1.0 here: https://github.com/NTMC-Community/MatchZoo/tree/1.0

I would suggest you use my repo since I have some updates based on the code above so that we can apply MatchZoo to Robust04 and Tweet datasets: https://github.com/Victor0118/MatchZoo/tree/rerank/data/robust04 and https://github.com/Victor0118/MatchZoo/tree/rerank/data/tweets

searchivarius commented 5 years ago

Hi @Victor0118 thanks a lot for the quick reply!