[Feature request] Add Graph Deep Learning Capabilities

marvin-hansen commented 4 years ago

Experience Report

Currently, I am building a unified graph that converges data, compute and machine learning. On the data side, I use Postgres & a graph database. Still working on making DGraph work, but that's a very different story. On the compute & machine learning side, everything integrates through webservices. On the integration layer, all data and web services get queried, accessed, and mutated through a master GraphQL layer.

It works but when I recently integrated another machine learning use case, I ended up sending a bunch of data forward and backward and that made me think of a better way.

What you wanted to do

I had a similar situation a few times before, which means I move data around that isn't exactly great because processing cannot be done where the data is, or the data aren't where the processing is happening. Either way, this is stupid, and Hadoop really isn't the answer either.

What you actually did

Eventually, I did what everyone else would have done: Load the data, send it to the ML service, processes it, and store the results back to the datastore. Simple.

Why that wasn't great, with examples

There are so many problems: 1) No data / compute locality 2) Unnecessary traffic 3) Lousy latency 4) Real-time is getting harder due to another network hop 5) Processing larger datasets, well, isn't fun precisely because of the implied data loading.

What is the main idea?

One of the most intriguing properties of a DGraph comes in the form of context locality that naturally follows from predicate sharding. When predicates P1 ... Pn resides on node-X, then all queries against predicates on that node-X are, by definition, independent from all other predicates on all other nodes and thus can be executed in parallel. Ok, we all know that.

However, it also follows that any algorithm operating on predicates can be parallelized by default and because of that, moved to the node where the data are located, traverse through the corresponding sub-graph and compute stuff. That is data parallelism in its purest form.

What would be a truly great way of doing this?

The greatest possible way I can think of would be processing data where they are stored by adding a third kind of node to Dgraph that hosts data & ML algorithms. Due to the distributed nature, I think aggregated real-time result streams might be possible.

It requires a third kind of node to prevent adding more load to the rest of the system so that normal queries remain unaffected performance-wise. The alpha nodes just dispatch queries and mutations to zero nodes, and sends ML workload to ML nodes. Plus, the ML instance might get pinned to a high-spec or GPU machine. With the current work on the GraphQL endpoint, the ML algorithm can be then exposed and parametrized through a custom resolver.

More details & considerations

This idea is based on my current practice of integrating ML services into the unified graph by only taking a start & stop ID as a parameter to let the custom resolver query all data required before passing them to the ML server. It does work in practice but comes at the caveat that there are some limits of how much data you can query and send over the network before something slows down. Obviously, this makes a lot of sense when the sub-graph remains reasonably small, but it doesn't work anymore on a large graph. At some point, the network isn't your friend anymore.

The number of usable algorithms that are suitable for a large graph is relatively low, but these solve really important problems, among other things community predictions, graph structure classification, and predicting shifts in structural (im)balance.

Specifically, the algorithms in DGL are a terrific contender for these tasks because they can manage arbitrary graph size through either batch loading or message passing, and DGL already supports multi-processing and distributed training out of the box. Obviously, loading a large graph out of a distributed DGraph cluster, shuffle the data to a DGL cluster, just to distribute them again for processing sounds as stupid as it actually is. And in many ways, that is how it's done in practice.

I do not believe that kind of data-shuffling between distributed data storage and data processing is going to be sustainable. And the leading DB vendors already know that for some time.

Oracle absolutely nailed its in-database machine learning precisely because, at some point, it is actually easier to move the headquarter than moving data out of the data warehouse. Even the guys at neo4j got this simple message and started adding some ML capabilities.

You simply cannot load and transfer humongous data anymore so you have to process them where they are and that is exactly why in-database machine learning will only grow in importance.

Doing so, however, remains a prevalent pain-point with no truly great solution for graph data.

Any external references to support your case

https://docs.dgl.ai/api/python/graph_store.html

https://docs.dgl.ai/tutorials/basics/1_first.html

https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf

shekarm commented 4 years ago

Thank you for the question and the detailed explanation. This seems like a feature request that will permit one to run ML algorithms in a distributed manner instead of running a big query and supplying the results to a node running ML code. This is a good idea and i would like to solicit votes from the community and appropriately prioritize in our backlog.

hackintoshrao commented 4 years ago

Hey @marvin-hansen ,

Thanks for the detailed explanation. For sure, I agree that moving all the data through the network from storage nodes to compute nodes is not scalable. Moving the compute to the storage nodes is the way to go.

Yes, adding ML compute capabilities on Dgraph's storage is definitely a useful feature, but, before we dwell into the design details of supporting ML algorithms on top of Dgraph, the question we ask ourselves is Are we ready to integrate Machine learning capabilities into the product?

I'm afraid that the answer is not yet. Our priority this year is to add the must-have features for a graph database, improve upon the performance, stability and overall experience of the product. Once we get to a sweet spot, we'll focus on unveiling specific advanced use cases on Dgraph, and Machine learning is definitely a compelling use case for Graph database.

On a different note, if you look at the landscape of Deep learning pipelines, there are popular patterns where the storage and compute layers are separated, and increasingly Amazon S3 or S3 like storage system (minio.io) becoming the popular choice for the storage layer.

For instance Apache Spark + S3, Tensorflow distibuted + S3, Presto + S3 ...

marvin-hansen commented 4 years ago

Thank you @hackintoshrao

That's okay and accepted. Still have a DGraph instance running for operations, and I hate it from A to Z because every time I debug the data-flow, I eventually end-up in yet another Dgraph issue.

I sincerely wish somebody at DGraph would stop publishing all the pointless fluff about fuzzy search on tweets, and start listing to how people actually use DGraph and why they use it that way.

manishrjain commented 4 years ago

Hey @marvin-hansen ,

Can you expand on what’s making you “hate” Dgraph so much. What Dgraph issues did you run into, that soured your experience? I’m the founder and you have my attention, I’m listening.

marvin-hansen commented 4 years ago

Hey @manishrjain

For the issues, search the issue tracker to figure out what's broken and fix it.

In my evaluation for selecting a database, DGraph scored the worst in terms of total time to pre-production and I simply hate it from A to Z when something wastes my time so hard. Yes, I made it work, but for sure, it was the worst of the worst experience possible.

We only keep Dgraph in operations because of its GraphQL endpoint that seamlessly integrates into our unified graph and that's the only reason I can possibly imagine.

When your thing deploys and scales reliably to a trillion predicates, gets to a 100 bn ops/ sec, and comes with proper LDAP integration we have a serious conversation.

=== Deployment evaluation.

1) Neo4J - Deployed within 1 hour, secured, and ready for development & testing.

2) RestDB - Deployed, test data imported and cleared for development within 1 hour.

3) TigerGraph ~ 20 - 30 minutes to testing, skipped data migration but technically easy.

4) FaunaDB ~ 25 minutes to setup & import GraphQL schema & data.

5) ArangoDB ~1h deployed & ready to develop. Skipped data migration

6) PostgresSQL - 25 minutes, all data migrated, and in production(!)

7) DGraph 1.x - Ten days(!)

Comment on Dgraph 1.x: Reported multiple issues, K8s configuration broken, no proper helm charts, no LDAP / AD integration, required additional API gateway to secure remote access, data migration requires file conversion to JSON(!), - Who do these guys think they are??? impossible to recommend.

manishrjain commented 4 years ago

When your thing deploys and scales reliably to a trillion predicates, gets to a 100 bn ops/ sec, and comes with proper LDAP integration we have a serious conversation.

100 bn ops / second?! If any of these other DBs mentioned here (two of them don't even scale horizontally -- Neo4j and PostGres) can give you such a number, then you should absolutely go ahead with them.

Lack of LDAP integration isn't something I'd consider "broken". It's a feature request, at best. Data migration requiring JSON -- is that a problem? I don't see it. What format would you use? CSV/XML?

Our K8s configuration is already being used by Fortune 500 companies in production -- I'm sure we could improve it, but I find it hard to believe your claims about them being "broken" -- Broken would mean K8s configs don't work, period.

Sorry Dgraph doesn't work for you -- Good luck with your DB search.

marvin-hansen commented 4 years ago

https://tech.marksblogg.com/benchmarks.html

dgraph-io / dgraph