andrewnguonly / Lumos

A RAG LLM co-pilot for browsing the web, powered by local LLMs
MIT License
1.36k stars 97 forks source link

Embeddings cache regression (^h^h^h confusion) #90

Closed sublimator closed 6 months ago

sublimator commented 7 months ago

I just updated to commit 72439bfc8391814ad0b933534cc8c37dc9101de7 but it seems like there is a regression?

image

The TTL is 60 minutes but it seems like it's requesting a series of embeddings for each query.

Ok, so I uninstalled it, then reinstalled it, in case my chrome storage options got in a wonky state somehow.

It's then not showing the connection indicator (which I was /was/ seeing at first!) for the model 404

image image

So, back to the embeddings, I've removed/installed. Once I select a model in options hopefully we are good?

Response:

image

Lots of embeddings (long page):

image

Hrmmm, it definitely seems like it's calling the embedding endpoint many times for each query. I could have sworn you were caching, that's what the TTL means, right!?

Oh, it's not cached when isHighlightedContent:

          chrome.runtime.sendMessage({
            prompt: prompt,
            skipRAG: false,
            chunkSize: config.chunkSize,
            chunkOverlap: config.chunkOverlap,
            url: activeTabUrl.toString(),
            skipCache: isHighlightedContent,
            imageURLs: imageURLs,
          });

Based on:

const getHighlightedContent = (): string => {
  const selection = window.getSelection();
  return selection ? selection.toString().trim() : "";
};

Oh, I see! I guess it's a bit complicated to use the cache easily, eh?

Hrmmmmmm, hhere's other optimizations you could do, but compared to creating completely new embeddings a simple linear search over the highlighted string to see if it contains any of the chunks that would otherwise be returned by the configured parser (i.e. "canonical" chunks?) ?

andrewnguonly commented 6 months ago

A few comments...

  1. My thinking behind skipping the cache when highlighted content is selected is that a typical use case would be to highlight "smaller" chunks of text on a page that didn't already have a dedicated content parser (e.g. an infrequently visited site). In this case, embedding should be quick and a user would likely move on to highlight a different part of the page, which means the previous chunk doesn't need to be cached.
  2. It's still possible that a user "selects all" content (ctrl+a). In this case, vector search is still valuable.
  3. I've updated the search logic to use a combination of cosine similarity and keyword fuzziness search.
  4. The UI/UX doesn't give any indication that the cache is skipped when highlighted content is parsed. I can make some quick improvements here (e.g. documentation + messaging in app)
sublimator commented 6 months ago

The UI/UX doesn't give any indication that the cache is skipped when highlighted content is parsed

I haven't thought about this thoroughly. That said, my instinct is that you could/should cache embeddings depending upon the size of the selection (@see still possible that a user "selects all" and even other large selection cases)

The issue was a legit logging of my experiences. I really thought there was a regression, and it wasn't until I rummaged around in the code that I discovered what was happening.

Bypassing the cache is probably the sort of behaviour that many would find reasonable in its simplicity, provided they knew what has happening.

Which is to say, yeah, some kind of indication would help!