castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 457 forks source link

Getting Started #704

Closed Ricocotam closed 5 years ago

Ricocotam commented 5 years ago

Hi, I'm new here and I'd like to understand how this tool works. I used ElasticSearch as my retrieval engine for the last months before finding this. It's a great project and I think using the same environment for our experiments leads to more trustworthy results.

I'm looking to some information about how to index documents. I'm also looking how to reproduce papers like Task-Oriented Query Reformulation with Reinforcement Learning, Nogueira 2017 since I'm doing the same kind of work. Those kind of works requires an engine to spam queries (up to 200queries/s, or even more if possible).

Thanks for the help, Cheers

lintool commented 5 years ago

If you look at our README, it contains regressions and guides on how to get started with a bunch of document collections.

In particular, what corpus are you working with?

cc @rodrigonogueira4

Ricocotam commented 5 years ago

I'm currently working on MS-MARCO so no problem (though I have some issues reproducing regressions but I didn't put much effort in it) but I plan using custom documents from private datasets (though it's not gonna happen super soon). I just like understanding what's going on :)

Actually what I'd like to do is sending queries to the engine with small batches (like 32 queries). Using ES it was done through HTTP REST API but I don't understand how this could be done with anserini. It might not be possible nor the idea of the project.

Ricocotam commented 5 years ago

I read a bit more and it was pretty clear finally, just had to take more time reading, thanks for the help :)