Threshold 0 not working as documented

valstu commented 7 months ago

Describe the bug

Threshold section on documentation mentions following:

In this case, the threshold property is set to 0, which means that only the document containing the "slim fit" keywords will be returned. This applies to all the document properties; if a keyword is found in a property, and another keyword is found in a different property, the document will be returned.

Meaning that if threshold is set to 0 only the documents that have both keywords "slim" and "fit" will be returned. I noticed few cases where this was not true.

When using threshold 0, the search results sometimes included documents that didn't actually have both searched keywords but their score was higher than the ones that did have. This was probably due to the length difference of these documents, one of them was short and one of them was super long. Meaning the super long document got better score even if only one search term was found.

I tried to figure out why this happens and found out that in the comments algorithms.ts we are making an assumption that

// Find the index of the last result with all keywords. // Note that since score is multipled by 1.5 any time the token is encountered in results it means // that tokenScores and tokenKeywordsCount should always have the same order.

So we are trusting that tokenScores and tokenKeywordsCount are similarly sorted. Then we find the index of last item index in tokenKeywordsCount where tokenKeywordCount is equal to keywordsCount

Then this index is used to slice the results array, which goes wrong because sometimes the tokenScores and tokenKeywordsCount are not in same order.

While investigating this I also encountered this issue #641

To Reproduce

I made an simple example here https://replit.com/@valstu/FunctionalSpiffyPolygon#index.js This example indexes two documents, one long and one short one.

Run the index files to see the search results. Threshold is set to 0, one would expect the search to return the short file since it contains both of the keywords, as defined in documentation.

Expected behavior

When threshold is set to 0 you would expect the search to return only documents that have all the keywords, how ever this is not the case everytime.

Environment Info

OS: Replit
Node: 20.10.0
Orama: 2.0.7

Affected areas

Search

Additional context

No response

valstu commented 7 months ago

I wonder if it would make sense to create just single map with both, score and amount of tokens?

So this is current implentation:

const tokenScoresMap = new Map<InternalDocumentID, number>()
const tokenKeywordsCountMap = new Map<InternalDocumentID, number>()

const mapsLength = arrays.length
for (let i = 0; i < mapsLength; i++) {
  const arr = arrays[i]

  const entriesLength = arr.length
  for (let j = 0; j < entriesLength; j++) {
    const [token, score] = arr[j]
    const boostScore = score * boost
    const oldScore = tokenScoresMap.get(token)

    if (oldScore !== undefined) {
      tokenScoresMap.set(token, oldScore * 1.5 + boostScore)
      tokenKeywordsCountMap.set(token, tokenKeywordsCountMap.get(token)! + 1)
    } else {
      tokenScoresMap.set(token, boostScore)
      tokenKeywordsCountMap.set(token, 1)
    }
  }
}

what if it was something like this:

const tokenScoresMap = new Map();

const mapsLength = arrays.length;
for(let i = 0; i < mapsLength; i++){
    const arr = arrays[i];
    const entriesLength = arr.length;
    for(let j = 0; j < entriesLength; j++){
        const [token, score] = arr[j];
        const boostScore = score * boost;
        const oldScore = tokenScoresMap.get(token)?.[1];
        if (oldScore !== undefined) {
            tokenScoresMap.set(token, [oldScore * 1.5 + boostScore, tokenScoresMap.get(token)[1] + 1]);

        } else {
            tokenScoresMap.set(token, [boostScore, 1]);

        }
    }
}

(this would require minor changes to code below as well but maybe this helps)

Then we could sort the map based on score or tokenKeywordsCount and the order would be guaranteed to be the same. If threshold is 1 we could return sorted results based on score. If threshold is less than one we could sort it first by token count and secondarily based on score. This way the the calculation of lastTokenWithAllKeywords would work as intended. 🤔

valstu commented 7 months ago

I could do a PR about this as well, it might make it easier to grasp what I'm meaning :)

valstu commented 7 months ago

I updated the Replit example:

Published my PR version to NPM
Imported both, my PR version and Orama 2.0.7
Added comparison between with my version and Orama 2.0.7

askorama / orama