feat: semantic search for large repos vector store toolkit

michaelneale commented 1 month ago

this is using sentence transformers and embeddings to create a simple vector database to allow semantic search of large codebases to help goose navigate around.

model info:

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- trying to this model instead as may be faster/leaner: https://huggingface.co/sentence-transformers/paraphrase-albert-small-v2

To test:

uv run goose session start --profile vector

with a ~/.config/goose/profiles.yaml with:

vector:
  provider: openai
  processor: gpt-4o
  accelerator: gpt-4o-mini
  moderator: truncate
  toolkits:
  - name: developer
    requires: {}
  - name: vector
    requires: {}

Then try some query to ask where to add a feature, or anything which you think needs a semantic match

lifeizhou-ap commented 1 month ago

I've tried a scenario with the toolkits with vector and without vector.

It seems the configuration with vector is more consistent and quicker to find the relevant files (although the first time it has to build the vector, the time is ok, not long). 👍

I saw a warning message below but I guess it should be fine? (since the vector is created by the code that the user provides)

goose/src/goose/toolkit/vector.py:115: FutureWarning: You are using `torch.load` with 
`weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct 
malicious pickle data which will execute arbitrary code during unpickling (See 
https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default 
value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. 
Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via 
`torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have
full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
data = torch.load(db_path)

michaelneale commented 1 month ago

@lifeizhou-ap thanks - yes good catch, it should only load weights so that warning should go away.

michaelneale commented 1 month ago

@lifeizhou-ap do you mind giving this a try again and see if it is as good as before for you?

baxen commented 1 month ago

Very excited to try this out!

To match the rest of how goose works, I think it makes sense if we delegate the embedding off to the provider. That's a bigger refactor, but then it avoids installing heavy dependencies with goose out of the box(torch, locally downloading a model). It might drive higher performance too, but would need to test that. What do you think?

lifeizhou-ap commented 1 month ago

@lifeizhou-ap do you mind giving this a try again and see if it is as good as before for you?

LGTM!

michaelneale commented 1 month ago

@baxen do you mean each provider has its own embeddings impl local to it? Would that gain much over having just one (as it is all local, and not provider specific) or do you mean lives in exchange alongside providers? (and they can offer their own if they want?). Just not sure what benefit would be? (I might be missing something) but I am sure is doable. Wouldn't this also still bring over the dependencies as the providers are bundled together (if in exchange?) - ie there is no "lazy loading" of dependencies (I think?)

michaelneale commented 1 month ago

@baxen according to goose:

So that is not small - unfortunately a optional dependency isn't really viable for a CLI?

michaelneale commented 1 month ago

going to have a lot at some lightweight options here, and failing that, I will make this an optional and validate that (and likely merge it after that point).

michaelneale commented 1 month ago

hey @baxen how does this look with optional deps now?

ahau-square commented 1 month ago

A few thoughts

Code embedding search seems like a promising direction to pursue
We should consider and test different chunking strategies - embedding code snippets e.g., classes/functions rather or in addition to whole code files to get more pinpointed search
Probably worth benchmarking and evaluating the embedding models against alternatives e.g., ones specifically for code (https://huggingface.co/Salesforce/codet5p-110m-embedding, https://huggingface.co/bigcode/starencoder)
Why limit to models that can be run locally vs. use hosted models like the OpenAI embeddings API or potentially others that Block hosts e.g., through the Databricks model gateway?
Is the future idea to eventually have a vector store of code embeddings for each repo and have them be updated on merge? Doing so might lend itself to a better experience of not having to wait for your embeddings to compute.
From a UX perspective - I don't know how useful identifying similar files on their own is - but similar files fed in as context to a ChatGPT/Claude for someone to then ask questions over or generate code based on could be very useful

michaelneale commented 1 month ago

@ahau-square

From a UX perspective - I don't know how useful identifying similar files on their own is - but similar files fed in as context to a ChatGPT/Claude for someone to then ask questions over or generate code based on could be very useful

That is exactly what this aims to do in a simple way - that is all that is needed (the toolkit isn't for end users to see - but to help goose find where to look which is then used as context).

I think future idea would be for embeddings to change (but they aren't meant to be search - so for relatively stable codebase isn't a huge deal). Could certainly run it with other models and approaches - but the idea of a toolkit is you can use it or not (but also would like to have something that is "batteries included" for goose - if it is this approach or another, as I think as it is it needs help to find code to work on).

michaelneale commented 1 month ago

this approach with local model(s) works quite well, but it is a hefty dependency addition to goose. Remote/server based embeddings and search is one option (but very specific to provider and probably more work to maintain across - not sure of exact benefit yet). Another approach is to use tools like rq but with fuzzy searching plus some pre-expansion of a question into related terms: like you search for "intellisense" - then the accelerator model could expand that to "content assist... completion" etc (as per users intent) and then do a more keyword like search for that (porter stemming would be the old way, but with accelerator models I think we can do better). Won't be as good for code specific comprehension though so I still like the idea of a local ephemeral embeddings/vector and indexing system or service.

michaelneale commented 2 weeks ago

@baxen I can't work out how optional deps work with UV (they used to work - but not there now).

michaelneale commented 1 week ago

I am going to close this for now - but keep the branch around

block / goose

feat: semantic search for large repos vector store toolkit #23