Open aindlq opened 2 months ago
@aindlq Can you tell us how large your text corpus is, that is, how many word occurrences? And yes, as a very first step this should be configurable.
Right now it is:
INFO: Statistics for text index: #words = 10,714,805, #blocks = 42,960
That is only a subset of the whole data that we are going to load. I guess in the end it should be x5-10 from that number
@aindlq Have you tried just lowering the value? For 10M words, even MIN_WORD_PREFIX_SIZE = 1
should be fine. For 100M words, MIN_WORD_PREFIX_SIZE = 2
might still be OK.
@hannahbast seems to be working fine with MIN_WORD_PREFIX_SIZE = 3
, with a minor change to:
static cppcoro::generator<std::string> fourLetterPrefixes() {
....
for (char a : chars()) {
for (char b : chars()) {
for (char c : chars()) {
std::string s{a, b, c};
co_yield s;
}
}
}
}
Default value of
MIN_WORD_PREFIX_SIZE
is 4, which can be a bit too high for some searches. Is it possible to make it configurable? Would it work if I just change it to 3? It says here https://github.com/ad-freiburg/qlever/blob/14d6e1cc26b53f2ff563e98fcee57135df3537bb/src/index/IndexImpl.Text.cpp#L482: