Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
15.54k stars 1.19k forks source link

Add Kùzu graph database as a graph store #134

Open prrao87 opened 2 months ago

prrao87 commented 2 months ago

Hi,

This is a great project, looking forward to trying it out more and experimenting with some workflows!

I'm looking at the portions of the code that utilize networkx for the graph retrieval and/or visualization, and I was thinking, the existing use of Pandas DataFrames makes this workflow very amenable to using Kùzu, an embedded graph database that's very similar to DuckDB and LanceDB in philosophy (to be easy to deploy, and fully open source). Using a graph database with persistence and durability guarantees, rather than an in-memory database like NetworkX, is preferable. And the fact that Kùzu is embedded and open source makes it that much more simple and user-friendly, in the same way that LanceDB and DuckDB are.

It's trivial to read data into a Kùzu graph via Pandas, as described here. The benefit of using Kùzu, imo, over NetworkX, is that Kùzu can scale very well to out-of-memory data, and imo, it's the perfect compliment to LanceDB for users who are familiar with that database from a vector search perspective.

Additionally, it's also trivial to convert a Kùzu graph into a networkx Graph or DiGraph object, which can be used for all downstream workflows that require networkx objects.

The Microsoft GraphRAG repo also uses LanceDB as its default vector store, and the reason Kùzu isn't used there (they also leverage NetworkX for their graph computations) is that at the time of them writing their code, Kùzu wasn't well known enough. I think that's changing, as Kùzu is becoming more and more popular (disclaimer: I work at Kùzu).

I just wanted to create this issue so that this could be something that's on the roadmap, and I'd be happy to try out the framework more and offer my inputs as this project grows. Cheers!

taprosoft commented 2 months ago

Look interesting enough. LanceDB was MS choice for simplicity I suppose. We will consider adding this to the roadmap.