Few questions about /codebase command

abhinavkulkarni commented 9 months ago

Hey @sestinj,

Thanks for the great work!

I recently loaded this project continuedev/continue in my VSCode and I tried the /codebase command on it with a few sample queries. I have a few questions:

I tried to find details in the documentation, but couldn't find much.

The continue/docs/docs/walkthroughs/codebase-embeddings.md mentions embeddings are stored in ~/.continue/embeddings, but I believe they are stored in ~/.continue/index/chroma/. The ~/.continue/embeddings folder on my machine is empty.
I opened the above db using chromadb python package and collection.count() prints 5. How can this be the case?
```
from pathlib import Path
import chromadb
```

chroma_client = chromadb.PersistentClient(Path("~/.continue/index/chroma/default/chroma").expanduser().as_posix()) collection_name = "chroma-default" collection = chroma_client.get_collection(name=collection_name) collection.count()


3. I had a temporary file called `debug.ipynb` which I then deleted, yet it shows up in the invocation of `/codebase` command. How can that be the case?

https://github.com/continuedev/continue/assets/1565547/2903e25e-aee1-4097-ad98-7d8812723b4e

4. When are these embeddings calculated? Upon invocation of the first `/codebase` command?

5. If I change a file and save it, are these embeddings recalculated?

6. Is there a way for me to specify a different embedding model other than `all-MiniLM-L6-v2`?

Thanks!

sestinj commented 9 months ago

We now store the embeddings in the index folder, embeddings was used previously
It should be indexing all of the files in your workspace, so would be a bug likely if only 5 documents. If you saw > 5 documents in the response, contradicting that count, it's possible they came from meilisearch, which we also use to retrieve documents with exact keyword search. You'd probably be seeing poor results if this is what is happening
I haven't yet added the code to remove / add files in the index upon deletion/addition of a file. Currently we re-index upon window reload.
Upon window reload. It is done incrementally, so the first time you open a workspace it will take longer, then thereafter it should take less than a second
Right now only upon window reload, will change this next time we work on embeddings
The option we have currently is to use OpenAI's ada embeddings: https://continue.dev/docs/walkthroughs/codebase-embeddings. Allowing other local embeddings models will happen

There's a short explanation of how codebase indexing happens incrementally here, and the rest of the indexing code resides here, build_index.py being the entrypoint.

sestinj commented 9 months ago

We're working on transitioning away from the Python server, so for a short time we won't be focusing on indexing, but plan to really get these things right soon after

abhinavkulkarni commented 9 months ago

Thanks @sestinj,

In #2, you mentioned despite having only 5 embeddings in the chromadb, relevant results were returned from meilisearch. How does that work? Do you simultaneously do embedding search and keyword search to retrieve relevant chunks for RAG?

continuedev / continue

Few questions about /codebase command #690