Text Index. Make MIN_WORD_PREFIX_SIZE configurable.

ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.

Apache License 2.0

372 stars 44 forks source link

Text Index. Make MIN_WORD_PREFIX_SIZE configurable. #1391

Open aindlq opened 2 months ago

aindlq commented 2 months ago

Default value of MIN_WORD_PREFIX_SIZE is 4, which can be a bit too high for some searches. Is it possible to make it configurable? Would it work if I just change it to 3? It says here https://github.com/ad-freiburg/qlever/blob/14d6e1cc26b53f2ff563e98fcee57135df3537bb/src/index/IndexImpl.Text.cpp#L482:

If you need this to be changed, please contact the developers

hannahbast commented 2 months ago

@aindlq Can you tell us how large your text corpus is, that is, how many word occurrences? And yes, as a very first step this should be configurable.

aindlq commented 2 months ago

Right now it is:

INFO: Statistics for text index: #words = 10,714,805, #blocks = 42,960

That is only a subset of the whole data that we are going to load. I guess in the end it should be x5-10 from that number

hannahbast commented 2 months ago

@aindlq Have you tried just lowering the value? For 10M words, even MIN_WORD_PREFIX_SIZE = 1 should be fine. For 100M words, MIN_WORD_PREFIX_SIZE = 2 might still be OK.

aindlq commented 2 months ago

@hannahbast seems to be working fine with MIN_WORD_PREFIX_SIZE = 3 , with a minor change to:

static cppcoro::generator<std::string> fourLetterPrefixes() {
....
  for (char a : chars()) {
    for (char b : chars()) {
      for (char c : chars()) {
          std::string s{a, b, c};
          co_yield s;
      }
    }
  }
}