diplodoc-platform / diplodoc

Entrypoint to Diplodoc platform
MIT License
183 stars 6 forks source link

Documentation search comparison #17

Open danonechik94 opened 6 months ago

danonechik94 commented 6 months ago

Hello! I recently had an idea to compare different libraries/services that one can use to search docs or any kind of texts. For this purpose, I created a search index in opensearch, flexsearch and lunr.js and tested them all on several searches.

The source index consists of 1136 entries with title, short description and full content. The content contains documentation on various topics, but mostly cloud related. All three libraries/services were set up to search within these three fields.

Here's the link to the document, where I provided all the data I ended up with.

Let me know if you need more info!

vsesh commented 6 months ago

@3y3

danonechik94 commented 6 months ago

Hey! I compared one more lib and compiled research results in the following table. Source data is the same as in the doc I posted above:

metric lunr.js fuse.js flexsearch opensearch
search result highlighting yes yes no yes
additional data storage yes (not directly in results, but it is easy to implement data storage in a separate JS object) yes yes yes
spellcheck ability no yes/no (fuzzysearch is like spellcheck) no yes
ability for partial search of a sentence yes no no yes
support additional languages yes (14 languages, including arabic, russian and hebrew) no yes (built-in arabic and cyryllic) yes
configuration ability and flexibility has some boost options for fields, but some api is not documented properly, not that much configuration ability rich configuration ability, you can adjust scoring threshold, give fields weight and provide custom function for sorting and getters pretty good configurations ability for memory/search complexity, no configuration for field weights (only if you use strict tokenizers and context search), a lot of useless configurations -
maturity/maintenance project is abandoned since 2020, no issues or prs are being merged, 100 opened issues out of 250 resolved. 2m downloads weekly maintained irregulary with pause in 2023, last update is recent, was actively maintained in 2021 and 2022. issues are actively fixed. 2.5m downloads weekly project looks abandoned since 2022.55 issues not answered out of 250 resolved, 238k downloads -
search relevance in some cases lunr does not find the best results and sometimes some of the results are missing, but basic relevance stays fine. Due to lack of spellchecking, some of the results are not being found. Also lunr.js lacks in partial string search - great example is "ipv" search that was supposed to find ipv4 and ipv6 results, but found nothing. Also scoring of the results is not clear, lunr.js often provides more relevant results further in the search results. If compared to opensearch, lunr sometimes does not find the best results and sometimes is missing some results, but as mentioned above - it finds satisfactory results. same as opensearch, fuzzy search feature provides too much irrelevant results, especially if you set a high threshold. But if it is really low, fuse produces too few results and often does not find any results or finds irrelevant ones. But for a small number of cases produces much better result than others. Also it takes a lot of time to complete most searches, going up to 100ms for one search. the strongest point of flex search is partial search. It does not have an ability to do partial sentence search. Sometimes results are irrelevant due to partial search or just in general flexsearch find some irrelevant results when it seems that no search hits should have occurred. Flexsearch also lacks in ranking. Overall good portion of the results are non-satisfactory. It is a benchmark. Most issues in opensearch come from spellcheck and sometimes bad ranking.

search result highlighting - indices of the found entries in search results additional data storage - an ability to get the underlying index data with the search results spellcheck ability - build-in spellcheck ability for partial search of a sentence - when search query finds entries with only part of the query ("container registry credentials" searches the index with "registry credentials" or just "credentials") support additional languages - built-in or external support for other languages configuration ability and flexibility - configurability of the index (score, sorting) and search maturity/maintenance - is the lib being maintained actively and how many downloads it has on npm search relevance - subjective assessment of search results tests

vsesh commented 2 months ago

We decided to use the Lunar Library in the end. Here are a couple of suggestions on how to use it with YFM.

You can generate a search-index (JSON file) using the files.json file.

...
const fileListContent = readFileSync('files.json', 'utf-8');
const list = JSON.parse(fileListContent).files;

const indexItems = list.reduce((result, pagePath) => {
  const indexItem = processPage(pagePath);
  indexItem && result.push(indexItem);
  return result;
}, []);

writeFileSync('search-index.json', JSON.stringify(indexItems), 'utf-8');
...
function processPage(pagePath) {
  const localPagePath = join(docsRootPath, pagePath);
  const filePath = existsSync(`${localPagePath}.md`) ? `${localPagePath}.md` : join(localPagePath, `index.md`);

  if (!existsSync(filePath)) {
    return null;
  }

  const fileContent = readFileSync(filePath, 'utf-8');
  const transformResult = yfmTransform(fileContent, {
    path: filePath,
    root: docsRootPath,
    allowHTML: true,
  });
  const {html, title = ''} = transformResult.result;

  return {
    id: pagePath,
    title,
    content: stripHtml(html).result,
  };
}

After that, you can load the search index and initialize the lunrJS instance.

fetch('search-index.json')
.then((indexDataRes) => indexDataRes.json())
.then((indexData) => {
  index = lunr(function () {
    this.field('title', {boost: 5});
    this.field('content', {boost: 4});

    this.metadataWhitelist = ['position'];

    indexData.forEach((indexItem) => {
      this.add(indexItem);
      itemById.set(indexItem.id, indexItem);
    });
  });
})
.catch(console.error);

And the search method:

async function search(query) {
  let searchResults = [];
  const rawResults = index.search(query);
  return rawResults.reduce((acc, item) => {
    const {ref} = item;
    const indexItem = itemById.get(ref);
    if (indexItem) {
      acc.push({...indexItem});
    }
    return acc;
  }, []);
}