Hi, I read your code regarding BM25. I find that it's backed by elasticsearch. But I'm not so familiar with it. May I ask you a few questions about it?
What tokenizer does elasticsearch use? A BERT-like subword tokenizer, or a custom one?
Can elasticsearch in your implementation handle BERT-tokenized text? For example, assume a sentence "I like beir" corresponds to "101 146 1176 1129 3161 102" after BERT tokenizer (still in string format), then we feed the string of IDs to BM25, can it be splited to ["101", "146", "1176", "1129", "3161", "102"] where each word ID is treated as a single word to be indexed?
Hi, I read your code regarding BM25. I find that it's backed by
elasticsearch
. But I'm not so familiar with it. May I ask you a few questions about it?elasticsearch
use? A BERT-like subword tokenizer, or a custom one?elasticsearch
in your implementation handle BERT-tokenized text? For example, assume a sentence "I like beir" corresponds to "101 146 1176 1129 3161 102" after BERT tokenizer (still in string format), then we feed the string of IDs to BM25, can it be splited to ["101", "146", "1176", "1129", "3161", "102"] where each word ID is treated as a single word to be indexed?