castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.66k stars 371 forks source link

[Question] Pyserini in production application? #1594

Closed steve-marmalade closed 1 year ago

steve-marmalade commented 1 year ago

Hello, thanks for this useful toolkit! My question is regarding the intended use of this tool: am I correct that the authors intend Pyserini only to be used for research / batch evaluations, and that a tool like Elasticsearch should be used for handling live searches in production?

As a data scientist who is using Pyserini for establishing BM25 baselines and evaluating hybrid search approaches, there's something appealing about putting a REST API directly in front of the tool that I'm already using. However, the lack of coverage on this topic makes me think this is not recommended.

lintool commented 1 year ago

Hi @steve-marmalade - apologies for the late response... but we've actually written an entire paper about this topic!

See: https://dl.acm.org/doi/10.1145/3488560.3502186

tl;dr - yup, if you want something quick-and-dirty, you can just throw up a REST API in front of Pyserini. It's convenient for rapid prototyping! But when you start moving your application into production... you'll soon discover that Pyserini's missing a feature (or two). Sure, you can implement it rather quickly since you've got access to Lucene and all the "kit of parts"... but after a few iterations... as you move towards production... you start replicating a "full featured" platform like Elasticsearch, OpenSearch, Solr, etc. You end up kinda re-inventing the wheel for all those "mundane" non-research-y features that's not included in Pyserini...

Does this answer help?

steve-marmalade commented 1 year ago

Yes, super clear, thanks for the thoughtful response 🙏