fynnfluegge / codeqai

Local first semantic code search and chat powered by vector embeddings and LLMs
Apache License 2.0
385 stars 46 forks source link

feat: sync latest git changes with vector db #2

Closed fynnfluegge closed 8 months ago

fynnfluegge commented 11 months ago

Currently the vector database is created initially from all files in the project/git repository. It should be possible to update the vector database based on the latest git changes. Only the vectors in the database that are related to a file that has changed since the last vector creation should be updated.

Possible solution: Save for every file the commit hash of the last change in the cache. At sync compare the saved commit hash with the current hash of the file. If the hash differs delete any vector related to the file in the database. Save filename in the metadata and query by filename if possible. If not possible, maintain for every file a list of faiss id's that gets deleted if the file has a new git hash and insert the new vectors to faiss.

chache json

{
  "filename":
  {
    "vector_ids": [1,2,3,4,5],
    "commit_hash": "artkekrx023",
  }
}

https://github.com/langchain-ai/langchain/issues/2699#issuecomment-1618163649