castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 458 forks source link

Cord-19 index and result #1153

Closed zkt12 closed 4 years ago

zkt12 commented 4 years ago

I follow exactly the instructions to build cord19 index, and retrieval. The ndcg@10 (query-udel round1 04-10) on full-text index is 0.4996, which is claimed 0.5407. Why? And, when I build the paragraph index (05-01), only 1.72m are indexed, which is claimed 1.76m.

lintool commented 4 years ago

As a first step to debugging, why don't you start with the pre-built indexes first? We also provide the 05-01 index pre-built.

After that, we need more details... what OS, Java version, etc. It'd be helpful if you copy and pasted the exact commands to be clear.

zkt12 commented 4 years ago

I tried to get pre-built indexes, seems the dropbox is not available for mainland China, even use VPN.

So I built the indexes like this:

DATE=2020-05-01 DATA_DIR=./cord19-"${DATE}" mkdir "${DATA_DIR}"

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/comm_use_subset.tar.gz -P "${DATA_DIR}" wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/noncomm_use_subset.tar.gz -P "${DATA_DIR}" wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/custom_license.tar.gz -P "${DATA_DIR}" wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/biorxiv_medrxiv.tar.gz -P "${DATA_DIR}" wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/arxiv.tar.gz -P "${DATA_DIR}" wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/metadata.csv -P "${DATA_DIR}" ls "${DATA_DIR}"/*.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}"

sh target/appassembler/bin/IndexCollection \ -collection Cord19ParagraphCollection -generator Cord19Generator \ -threads 8 -input "${DATA_DIR}" \ -index "${DATA_DIR}"/lucene-index-cord19-paragraph-"${DATE}" \ -storePositions -storeDocvectors -storeContents -storeRaw -optimize > log.cord19-paragraph.${DATE}.txt

and retrieval like this:

target/appassembler/bin/SearchCollection -index lucene-index-covid-paragraph-2020-05-01 \ -topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1-udel.xml -topicfield query -removedups -strip_segment_id \ -bm25 -output runs/run.covid-r1.paragraph.query-udel.bm25.txt

My OS is Ubuntu 16.04, Java 11.0.6

lintool commented 4 years ago

Well, you're trying to evaluate retrieval against the 5/1 corpus using qrels from the 4/10 corpus, so of course your numbers are going to be lower... TREC-COVID round 1 is against the 4/10 corpus, so to replicate results you'll have to use that.

zkt12 commented 4 years ago

Sorry, I didn't make it clear. I've built the abstract, full-text and paragraph indexes for both 04-10 and 05-01 corpus. There are 2 differences with what you claimed:

  1. When I retrieval on 04_10 full-text index, the ndcg@10 is 0.4996.
  2. When I build 05_01 paragraph index, there are only 1.72m are indexed.
lintool commented 4 years ago

Our indexes are mirrored here: https://git.uwaterloo.ca/jimmylin/cord19-indexes

Can you try with pre-built indexes?

lintool commented 4 years ago

I think I understand the issue now. After the construction of the pre-built indexes, I manually did some data cleaning to blacklist a few outlier documents, see: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/Cord19Generator.java#L95

For example: https://github.com/castorini/anserini/commit/37491d114ce09b3de16e93027846b532059d95a7

Note that this was done independently of search results; in fact you can see from the commit id that my manual cleaning predates the release of the round 1 results.

Thus, if you use the latest HEAD to go back and index the corpus from 4/10, you'd get slightly different document counts. This changes term and document statistics slightly... apparently enough to have an impact on the effectiveness. Small changes though.

Hope this clarifies things.

To be clear, this explains:

And, when I build the paragraph index (05-01), only 1.72m are indexed, which is claimed 1.76m.