Performance, scaling and databases

steffansluis commented 6 years ago

I'm aiming to process a lot of data as quickly as possible, and I'm finding the Postgres connection limit backing my Cayley is being the bottleneck. I started thinking: I can scale up, but maybe there is a bigger issue. It got me wondering about the general performance of Cayley, in particular how it scales with regards to different databases backing it. I looked around a bit and read (skimmed) this.

I feel like it should be doable to set up performance testing with regards to different use-cases and backends. It can be done in fases, general perfomance testing first, backend specific testing later. My first thoughts would be to make use of CircleCI 2.0 here because of its affinity with docker. I'd be happy to help out, although I cannot pledge any time for now. In any case, I'm curious about your thoughts :smiley:.

dennwc commented 6 years ago

First of all, yes, per-database benchmarks are something we definitely want to have at some point, although using a CI service might distort results significantly since it's running on a shared hardware.

In any case, your point about scaling is correct and consists of three main factors: 1) Does backend DB scales well? This is not the bottleneck in most cases, although we might hit limits on Bolt and Postgres sometimes. 2) Do we use transactions/queries of that DB optimally in the meta-backend driver? 3) How fast is our graph layer on top of that specific meta-backend?

Currently, we have only 3 generic meta-backends: KV, SQL, NoSQL. And we have a separate implementation of graph layer for each of them. Thus, we can benchmark this layer for each meta-backend separately by implementing each abstraction at least once, without testing each supported DB.

And, by writing benchmarks for meta-backend itself, we can measure a relative read/write performance for each DB driver compared to other DB drivers. Again, this is totally isolated from an actual graph layer implementation - it benchmarks our abstraction over that DB kind.

Some time ago Cayley as a project included all DB drivers and implemented a graph layer for each of them. Recent work on v0.7 was dedicated to unifying backends into meta-backends to remove duplicated code for databases fo similar kind. And now, since we have it implemented, we can move all actual database drivers into a separate package. I started to work on this as a part of v0.8 and will eventually move all DB-specific code to a new project called hidalgo.

I'm telling this to emphasize that we should split these two benchmarks - test all DB drivers as a part of Hidalgo project, and test graph layers on top of these meta-backends in Cayley.

steffansluis commented 6 years ago

Yes, I vaguely remember browsing the code and seeing the meta-backend abstractions! I agree running the tests on CI would be far from ideal, I mostly suggested it as an easy starting point. That being said, I still think using Docker as an abstraction layer would greatly ease testing per DB/meta-backend. I like the idea of splitting up the DB-specific code into a separate project, it would allow Cayley to really focus on providing abstract graph logic. In any case, feel free to shout if/when this becomes an active concern, I'd be happy to aim to help out!

steffansluis commented 6 years ago

I've been thinking about this quite a bit, specifically what I would like to gain from these benchmarks. I want to dump my thoughts as a starting point for what the benchmarks should accomplish. As I see it, fundamentally every backend is designed to serve (a) particular use-case(s) optimally. Depending on the needs of the user, Cayley may be configured with a different backend. The benckmarks should guide the users of Cayley to the backend best suited to their need. Examples of such needs would be:

Dealing with individual quads vs 'documents' (i.e. all of the quads describing a single subject).
Dealing with data with different degrees of relatedness.
Dealing with lots of small requests vs a few big requests.
Dealing with request speed while the amount of data grows.
Anything and everything related to the CAP-theorem with regards to each backend.

Basically, I imagine the process of a someone starting to use Cayley to be as follows:

Identify usecase: What kind of needs do I have?
Consider options: Looking at the benchmarks, what configuration would best suit my needs?
Weigh options: What kind of scale do I need, what is easiest to deploy for the most gain, etc.

Effectively, I think it would make sense to expect a KV-store to be quick with logic related to individual quads, while a relational DB would probably be well suited when doing a lot of traversals. NoSQL dbs are great for documents but not so great for traversals and individual quads? Maybe? These questions should be answered by the benchmarks. Additionally, since Cayley adds an abstraction layer and exposes several different APIs to work with, the benchmarks should also cover which API best suits which usecase.

Some example questions:

Why should I use Cayley with one type of backend vs another?
Within the different types of database, what are the advantages of each option?
Given the structure of the data I want to recover from Cayley and how I want to recover it, which is the best API to use? (this one might be less relevant as the APIs evolve)

Thus concludes my ramblings.

Dragomir2020 commented 6 years ago

Hey, I've been using Cayley with the elastic search backed and when I load data through the CLI it is painfully slow.

I am currently creating a .nq file using pythons rdflib with my data

I would like to use this to build knowledge graphs that are on the magnitude of millions of nodes, but at the current moment it took me like an hour to import a graph with 5,000 n-quads

I was just wondering if you guys have had experience with any of the back ends being faster?

dennwc commented 6 years ago

@Dragomir2020 Inserts to NoSQL backends wound be slower comparing to SQL right now. You may try to change the batch size for an import, but it won't change the situation that much. I'm currently working on a two-stage import that will make it significantly faster to large quad files (testing on a Freebase data dump).

Dragomir2020 commented 6 years ago

@dennwc Thanks for the info and the current progress!

cayleygraph / cayley

Performance, scaling and databases #716