beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Question on the BM25 implementation #69

Closed jordane95 closed 2 years ago

jordane95 commented 2 years ago

Hi, I read your code regarding BM25. I find that it's backed by elasticsearch. But I'm not so familiar with it. May I ask you a few questions about it?

  1. What tokenizer does elasticsearch use? A BERT-like subword tokenizer, or a custom one?
  2. Can elasticsearch in your implementation handle BERT-tokenized text? For example, assume a sentence "I like beir" corresponds to "101 146 1176 1129 3161 102" after BERT tokenizer (still in string format), then we feed the string of IDs to BM25, can it be splited to ["101", "146", "1176", "1129", "3161", "102"] where each word ID is treated as a single word to be indexed?