Closed sublimator closed 8 months ago
Well it's not working very well with tinyllama, but regardless :)
Upping to Phi2 seemed a bit better fwiw
Well it's not working very well with tinyllama, but regardless :)
Interesting idea. By "not working very well", do you mean the retrieval/search results were bad and resulted in a bad response overall?
Hi @andrewnguonly
I was just doing quick checks like "what did user x say?" on a page of comments and it was not getting things correct. TBH, I'd have to compare results against SoTA embedding/LLM pairing to get more calibrated expectations for specific queries like that.
In any case, I think being able to set an embedding model to a smaller model for responsiveness could be a useful thing.
250ms vs 100ms per chunk is substantial.
Got it, thanks for clarifying. I'm wondering if this approach in combination with some retrieval/search optimization could make a difference. I haven't looked into it too deeply yet though.
retrieval/search optimization
I don't have any real experience with RAG yet, so I've "got nothing" I assume you meant something more like keyword search to more quickly find the relevant chunks?
I wonder if you could develop some kind of special query syntax for that, shall we say, mode?
Which makes me further wonder if you'd ever use a combination of "classical" search techniques along with vector similarity?
One or the other, or both, and how that would inform said syntax.
The tricky thing about this, compared to "normal" RAG, is the desire (requirement?) for quick responses. Typically all the embedding is done well before, right? Other than shared embeddings (non-trivial technical/political challenge) or keyword/stem-word search, I'm not sure what you can do.
nomic flies!
I just gave nomic
a quick test. Lightning fast! I'm tempted to just hardcode it (and fall back to the main model if it's not available). I'm hesitant to expose a separate configuration for the embedding model because of option fatigue. What do you think?
Which makes me further wonder if you'd ever use a combination of "classical" search techniques along with vector similarity?
Separately, I'm working on adding a "classical" keyword search (and hybrid search) to the RAG workflow. Check this out: https://github.com/andrewnguonly/Lumos/pull/101.
There will be a few other small improvements to the RAG implementation as well.
option fatigue.
You could just go with process.env to start with if that's a concern. Would allow folks to customize without needing to manage branches
Ollama only has 2 embedding models atm, but later?
Here's an open PR with the functionality to switch the embedding model: https://github.com/andrewnguonly/Lumos/pull/105
After testing, I'm finding that it's actually quite slow to switch between models. Ollama only stores 1 model in memory, so every prompt requires unloading the previous model and reloading the embedding model, and then the opposite immediately. There's an open issue in the Ollama repo (https://github.com/ollama/ollama/issues/976) addressing this (and a few closed ones with workarounds). I'm not sure where this is on the priority list for them.
I'm not sure if I'll merge the PR. Net-net, it doesn't seem like a significantly better improvement for the user experience (yet).
Disjoint musings:
You could leave it in there, just disabled/hidden I suppose if you want to save the work More generally: I am a fan of feature flags as branches bit rot
Given this is a developer tool, and people need to build it anyway, process.env.LUMOS_EMBEDDING_MODEL ought to suffice for people who just want to try out different embedding models.
You could also somehow call out to users to weigh in at the relevant Ollama issue
This is potentially relevant: https://github.com/ollama/ollama/pull/2848
Ollama v1.28.0 has a bug fix to stop Ollama from hanging when switching models. I'll test this out with my open PR.
Using tinyllama
Maybe you can "get away with" using a smaller model for quick embeddings to make things a bit more responsive ??