castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 458 forks source link

Build consistent set of Solr/ES integration test corpora #927

Closed lintool closed 4 years ago

lintool commented 4 years ago

Ref: #811

Currently, the Solr integration test script works only on Core18. The ES integration test script works on Robust04 and MS MARCO passage. Let's bring these into alignment - make both Solr/ES work on {Core18, Robust04, and MS MARCO passage}.

lintool commented 4 years ago

Originally planned to be part of v0.7.0 release, punting.

x389liu commented 4 years ago

I've added the test case for Solr on robust04 and it works as expected. But there are some problems for Solr on passage and ES on core18.

  1. For Solr retrieval on passage, when running
    sh target/appassembler/bin/SearchSolr -topicreader TsvString -solr.index msmarco-passage -solr.zkUrl localhost:9983 \
    -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
    -output run.solr.msmarco-passage.txt

    I got an exception org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.36.40.249:8983/solr/msmarco-passage: undefined field define which is caused by the Solr queryio.anserini.search.SearchSolr.search(SearchSolr.java:211): QueryResponse response = client.query(args.solrIndex, solrq);.

Just want to confirm if the above command looks okay? I will need to dig into the error if it looks correct.

  1. For ES indexing, there is not config file for core18. I created one similar to this but not sure about its k1 and b value. I tried the default value k1=0.9, b=0.4, but it gives me [FAILED] 0.2401 MAP, expected 0.2495 MAP. Do you have any suggestion on this?
lintool commented 4 years ago

@r-clancy can you help @x389liu out here?

ryan-clancy commented 4 years ago

@x389liu

For the first error, it's because Solr queries are of the form <field>:<query_terms> and several of the queries from passage have define: in them, with the colon being the culprit - Solr thinks define is a field (which isn't in our schema).

ryan@thinkpad ~/sync/git/anserini [master] $ grep "define:" src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt                                                                                       [09:45:37]
129491  define: dog's life
129517  define: entity's
129641  define: morbid obesity
129684  define: portrait
129792  define: systemic
129837  define: wrongful prosecution
999921  define: shore up
1000083 define: precipitous delivery
1001397 define: barrage
1001903 define: (cancelling)
129565  define: homologate

I think some additional query cleaning here would do the trick.

For the second, nothing jumps out at me. I'd print out the queries sent to Elastic (in SearchElastic) and compare the queries sent to Lucene (in SearchCollection) and make sure they exactly the same, it could be a query sanitization issue. I'd also print out the document scores returned by each and compare. Relevant code is here and here.

lintool commented 4 years ago

Closed by #1030