aplbrain / grand

Your favorite Python graph libraries, scalable and interoperable. Graph databases in memory, and familiar graph APIs for cloud databases.
Apache License 2.0
80 stars 6 forks source link

Use SQLbackend, how to connect to existing SQL DB file and read the graph into memory #19

Closed yangyxt closed 2 years ago

yangyxt commented 3 years ago

As showed in the title.

j6k4m8 commented 3 years ago

If you have already created a graph on disk with sqlite, you can read it like this (assuming your database file is called example-graph.sqlite):

import grand
from grand.backends import SQLBackend

G = grand.Graph(
    backend=SQLBackend(db_url="sqlite:///example-graph.sqlite", directed=True)
)

You can now interact with the graph through whatever dialect you like. If you want to use NetworkX-style APIs, you can list edges like this:

G.nx.edges()

Unfortunately, there isn't currently a great way to export and import graphs between Backends yet, as this can be a super expensive operation. If your graph is relatively small, you can export nodes and edges individually into, say, a NetworkX object — like this:

H = nx.DiGraph()

for node, attrs in G.nx.nodes(data=True):
    H.add_node(node, **attrs)

for u, v, attrs in G.nx.edges(data=True):
    H.add_edge(u, v, **attrs)

Then you'll have access to all the same nodes in an in-memory NetworkX object:

H.nodes(data=True)

Does this answer your question?

To get you started if you don't have a sqlite graph file yet, you can create the example-graph.sqlite like this:

G = grand.Graph(backend=SQLBackend(db_url="sqlite:///example-graph.sqlite", directed=True))
G.nx.add_edge("X", "Y")
j6k4m8 commented 3 years ago

Hey @yangyxt — does this answer your question?

AbdealiLoKo commented 2 years ago

I have a similar question I'm thinking through. I have a custom schema (with ~10 tables and multiple relationships between the tables) already present in my RDBMS. I want to create a graph from that and run cypher queries on that to perform efficient queries.

Would I be able to use grand with grand-cypher for that ? https://github.com/aplbrain/grand-cypher

Or will it be inefficient due to the way RDBMS queries work

j6k4m8 commented 2 years ago

Hi @AbdealiJK! That's an interesting use-case, and I think that Grand should work for you, though I'm not sure we can make any efficiency guarantees since every "edge" in your query will be converted to its own query. One issue you'll likely encounter is that Grand currently assumes ONE table for nodes and ONE table for edges, but we might be able to get around that if you have a more-or-less consistent schema between the "nodelike" tables and we can form a virtual nodes table out of a join.

By way of excessive explanation: The reason we don't currently support this is that not all Backends support the idea of multiple node "types," and practically no Dialects do either. So there's nothing blocking this technically, it's just not the original use-case! :)

If you're interested in playing around with it, I'd be happy to help debug with you!

AbdealiLoKo commented 2 years ago

Yeah. I'm planning on trying it out over the weekend. But the main thing which I'm concerned about is the efficiency :( Because if we're fetching each edge with a SQL query, that will not go well ...

I guess the 2 things I'm thinkint about in this regard are:

My usecases are query-only. I.e. I will never modify my graph through networkx/other dialects.

But, I really like this idea. I was looking for a sqlalchemy for graphs. And grand (while still young) seems to be moving towards that direction. Will play around with it more and contribute if possible

yangyxt commented 2 years ago

Hey @yangyxt — does this answer your question?

Yes, thank you! And sorry for the late notice and response! Regarding exporting graphs, I chose to converting the database to text-based file objects to store the graph.

j6k4m8 commented 2 years ago

Awesome @yangyxt — hope you're able to use Grand in the future :)

@AbdealiJK: Great, let me know what you discover! I haven't really kicked the tires extensively on the SQL Backend in conjunction with Grand-Cypher, but in a few benchmarks it seemed pretty performant even on relatively large networks. (Not comparable with a "true" neo4j cypher query, but... not embarrassing, at least :) )

When I say "for each edge," that's each edge in the query, so if your Cypher query looks like

MATCH (A)-[]->(B)
RETURN A, B

This only runs one single table query. Still not ideal, but with some indexing probably pretty snappy.

I've also just published one of my backlog-todo's to add a caching layer to the Backends (#20); I think that would probably dramatically increase the speed of query responses if you can make assumptions about a static dataset.

I'd be super grateful and excited to see your contributions and evaluation! There's a lot of nuance to a generalized graph toolkit like Grand but I'm hopeful we're converging on something Good :)

j6k4m8 commented 2 years ago

Hey @AbdealiJK — curious to hear how your experiments went! Would be interested in hearing if you had to make any major changes to get things working.

AbdealiLoKo commented 2 years ago

Hi @j6k4m8 apologies I missed replying to your previous question. So. I essentially had to create a new backend which was like a read-only backend. And in the read-only backend, I am making like 10-20 different SQL queries to pull all the relationships I need from sqlalchemy into python. I only pull the IDs. Once i have this info, im creating a huge graph and then using it for query operations.

Differences here are:

j6k4m8 commented 2 years ago

Aha, that's super interesting @AbdealiJK — thank you for the update! If you think there are any good lessons-learned that we could integrate into the package, I'd love to help!

AbdealiLoKo commented 2 years ago

I think the main one for me was that a single query which creates my entire graph was much faster than I expected. Even if I am querying just a small portion of my graph, I currently create the entire graph - because I don't want to worry about the number of queries I have to run in case I have a deep subgraph. If there was some way for me to get the best of both - i.e. query a part and cache ... It would be cool too. But I feel like that will always be limited to iterative approaches because of the way the SQL has to be structured. So it seems like a memory-compute tradeoff.

I would also suggest having some hooks in grand so I can customize the SQL structure I have in an existing DB. Currently. Grand wants to manage my SQL DB for me in its efficient way. But my usecase is more like - I want to create a graph VIEW on top of my SQL DB