andrewnguonly / Lumos

A RAG LLM co-pilot for browsing the web, powered by local LLMs
MIT License
1.42k stars 102 forks source link

Separate (generally smaller) embeddings model? #77

Closed sublimator closed 8 months ago

sublimator commented 9 months ago

Using tinyllama

[GIN] 2024/02/10 - 12:33:07 | 200 |   51.673542ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/02/10 - 12:33:07 | 200 |    51.70225ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/02/10 - 12:33:07 | 200 |   51.951042ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/02/10 - 12:33:07 | 200 |   43.755125ms |       127.0.0.1 | POST     "/api/embeddings"

Maybe you can "get away with" using a smaller model for quick embeddings to make things a bit more responsive ??

sublimator commented 9 months ago

Well it's not working very well with tinyllama, but regardless :)

sublimator commented 9 months ago

Upping to Phi2 seemed a bit better fwiw

andrewnguonly commented 9 months ago

Well it's not working very well with tinyllama, but regardless :)

Interesting idea. By "not working very well", do you mean the retrieval/search results were bad and resulted in a bad response overall?

sublimator commented 9 months ago

Hi @andrewnguonly

I was just doing quick checks like "what did user x say?" on a page of comments and it was not getting things correct. TBH, I'd have to compare results against SoTA embedding/LLM pairing to get more calibrated expectations for specific queries like that.

In any case, I think being able to set an embedding model to a smaller model for responsiveness could be a useful thing.

250ms vs 100ms per chunk is substantial.

andrewnguonly commented 9 months ago

Got it, thanks for clarifying. I'm wondering if this approach in combination with some retrieval/search optimization could make a difference. I haven't looked into it too deeply yet though.

sublimator commented 9 months ago

retrieval/search optimization

I don't have any real experience with RAG yet, so I've "got nothing" I assume you meant something more like keyword search to more quickly find the relevant chunks?

I wonder if you could develop some kind of special query syntax for that, shall we say, mode?

Which makes me further wonder if you'd ever use a combination of "classical" search techniques along with vector similarity?

One or the other, or both, and how that would inform said syntax.

sublimator commented 9 months ago

The tricky thing about this, compared to "normal" RAG, is the desire (requirement?) for quick responses. Typically all the embedding is done well before, right? Other than shared embeddings (non-trivial technical/political challenge) or keyword/stem-word search, I'm not sure what you can do.

sublimator commented 9 months ago

https://ollama.com/library/nomic-embed-text https://ollama.com/library/all-minilm

sublimator commented 9 months ago

image

nomic flies!

andrewnguonly commented 9 months ago

I just gave nomic a quick test. Lightning fast! I'm tempted to just hardcode it (and fall back to the main model if it's not available). I'm hesitant to expose a separate configuration for the embedding model because of option fatigue. What do you think?

Which makes me further wonder if you'd ever use a combination of "classical" search techniques along with vector similarity?

Separately, I'm working on adding a "classical" keyword search (and hybrid search) to the RAG workflow. Check this out: https://github.com/andrewnguonly/Lumos/pull/101.

There will be a few other small improvements to the RAG implementation as well.

sublimator commented 9 months ago

option fatigue.

You could just go with process.env to start with if that's a concern. Would allow folks to customize without needing to manage branches

Ollama only has 2 embedding models atm, but later?

andrewnguonly commented 9 months ago

Here's an open PR with the functionality to switch the embedding model: https://github.com/andrewnguonly/Lumos/pull/105

After testing, I'm finding that it's actually quite slow to switch between models. Ollama only stores 1 model in memory, so every prompt requires unloading the previous model and reloading the embedding model, and then the opposite immediately. There's an open issue in the Ollama repo (https://github.com/ollama/ollama/issues/976) addressing this (and a few closed ones with workarounds). I'm not sure where this is on the priority list for them.

I'm not sure if I'll merge the PR. Net-net, it doesn't seem like a significantly better improvement for the user experience (yet).

sublimator commented 9 months ago

Disjoint musings:

  1. You could leave it in there, just disabled/hidden I suppose if you want to save the work More generally: I am a fan of feature flags as branches bit rot

  2. Given this is a developer tool, and people need to build it anyway, process.env.LUMOS_EMBEDDING_MODEL ought to suffice for people who just want to try out different embedding models.

  3. You could also somehow call out to users to weigh in at the relevant Ollama issue

sublimator commented 9 months ago

This is potentially relevant: https://github.com/ollama/ollama/pull/2848

andrewnguonly commented 8 months ago

Ollama v1.28.0 has a bug fix to stop Ollama from hanging when switching models. I'll test this out with my open PR.