cayleygraph / cayley

An open-source graph database
https://cayley.io
Apache License 2.0
14.86k stars 1.25k forks source link

Performance, scaling and databases #716

Open steffansluis opened 6 years ago

steffansluis commented 6 years ago

I'm aiming to process a lot of data as quickly as possible, and I'm finding the Postgres connection limit backing my Cayley is being the bottleneck. I started thinking: I can scale up, but maybe there is a bigger issue. It got me wondering about the general performance of Cayley, in particular how it scales with regards to different databases backing it. I looked around a bit and read (skimmed) this.

I feel like it should be doable to set up performance testing with regards to different use-cases and backends. It can be done in fases, general perfomance testing first, backend specific testing later. My first thoughts would be to make use of CircleCI 2.0 here because of its affinity with docker. I'd be happy to help out, although I cannot pledge any time for now. In any case, I'm curious about your thoughts :smiley:.

dennwc commented 6 years ago

First of all, yes, per-database benchmarks are something we definitely want to have at some point, although using a CI service might distort results significantly since it's running on a shared hardware.

In any case, your point about scaling is correct and consists of three main factors: 1) Does backend DB scales well? This is not the bottleneck in most cases, although we might hit limits on Bolt and Postgres sometimes. 2) Do we use transactions/queries of that DB optimally in the meta-backend driver? 3) How fast is our graph layer on top of that specific meta-backend?

Currently, we have only 3 generic meta-backends: KV, SQL, NoSQL. And we have a separate implementation of graph layer for each of them. Thus, we can benchmark this layer for each meta-backend separately by implementing each abstraction at least once, without testing each supported DB.

And, by writing benchmarks for meta-backend itself, we can measure a relative read/write performance for each DB driver compared to other DB drivers. Again, this is totally isolated from an actual graph layer implementation - it benchmarks our abstraction over that DB kind.

Some time ago Cayley as a project included all DB drivers and implemented a graph layer for each of them. Recent work on v0.7 was dedicated to unifying backends into meta-backends to remove duplicated code for databases fo similar kind. And now, since we have it implemented, we can move all actual database drivers into a separate package. I started to work on this as a part of v0.8 and will eventually move all DB-specific code to a new project called hidalgo.

I'm telling this to emphasize that we should split these two benchmarks - test all DB drivers as a part of Hidalgo project, and test graph layers on top of these meta-backends in Cayley.

steffansluis commented 6 years ago

Yes, I vaguely remember browsing the code and seeing the meta-backend abstractions! I agree running the tests on CI would be far from ideal, I mostly suggested it as an easy starting point. That being said, I still think using Docker as an abstraction layer would greatly ease testing per DB/meta-backend. I like the idea of splitting up the DB-specific code into a separate project, it would allow Cayley to really focus on providing abstract graph logic. In any case, feel free to shout if/when this becomes an active concern, I'd be happy to aim to help out!

steffansluis commented 6 years ago

I've been thinking about this quite a bit, specifically what I would like to gain from these benchmarks. I want to dump my thoughts as a starting point for what the benchmarks should accomplish. As I see it, fundamentally every backend is designed to serve (a) particular use-case(s) optimally. Depending on the needs of the user, Cayley may be configured with a different backend. The benckmarks should guide the users of Cayley to the backend best suited to their need. Examples of such needs would be:

Basically, I imagine the process of a someone starting to use Cayley to be as follows:

Effectively, I think it would make sense to expect a KV-store to be quick with logic related to individual quads, while a relational DB would probably be well suited when doing a lot of traversals. NoSQL dbs are great for documents but not so great for traversals and individual quads? Maybe? These questions should be answered by the benchmarks. Additionally, since Cayley adds an abstraction layer and exposes several different APIs to work with, the benchmarks should also cover which API best suits which usecase.

Some example questions:

Thus concludes my ramblings.

Dragomir2020 commented 6 years ago

Hey, I've been using Cayley with the elastic search backed and when I load data through the CLI it is painfully slow.

I am currently creating a .nq file using pythons rdflib with my data

I would like to use this to build knowledge graphs that are on the magnitude of millions of nodes, but at the current moment it took me like an hour to import a graph with 5,000 n-quads

I was just wondering if you guys have had experience with any of the back ends being faster?

dennwc commented 6 years ago

@Dragomir2020 Inserts to NoSQL backends wound be slower comparing to SQL right now. You may try to change the batch size for an import, but it won't change the situation that much. I'm currently working on a two-stage import that will make it significantly faster to large quad files (testing on a Freebase data dump).

Dragomir2020 commented 6 years ago

@dennwc Thanks for the info and the current progress!