perf: scaling cockroach clusters while running under cayley, slows performance

aselus-hub commented 7 years ago

Reference cayley code that does writes to cdb: https://github.com/cayleygraph/cayley/blob/d9a72b0288ed17c0601adbc92eb7cb79e5687729/graph/sql/cockroach.go

Essentially the use case is as follows: write 4 quads into cayley which creates 1 transaction into cockroachdb. The items in the quads are linked to one another ( example quads(the application is generating thousands of these per second):

<NodeA> <related_to> <NodeB> .
<NodeB> <related_to> <NodeC> .
<NodeC> <related_to> <NodeD> .
<NodeD> <related_to> <NodeE> .

NOTE: Application example to come later as per request in gitter. NOTE: All numbers are averages over 3 runs.

When ingesting these quad sets we then measure how many sets are ingested into cockroach through the cayley library. When done on local machine (OsX laptop, latest) we got numbers between 300\~600 sets ingested(each with 1~5 quads in their set)

When we took the same thing up to AWS and put 1 node cockroach and 1 node 'ingestion/generation' we got 260sets/second (accountable with latency, no errors cockroach was taking 350% of cpu resource via top, ingestion 75% of resources[on separate nodes])

We then expanded cockroach to 3 nodes, placed it behind an ELB and ran the test again the performance dropped to 230sets/second(300~CPU on first node, and 95% on the other two).

We then thought it might be the generator so we scaled that to two generators (which gave us the same performance 226sets/second) with the generators each taking <50% cpu.

We then cleaned cockroach and added two more nodes(same size), and got 198sets/second

AWS cockroach node info: i2.xlarge cockroach running on the 800Gb SSD as its storage device, tried with default(15GiB) and 50GiB of ram configured for each node ntp sync at stratum 2 sub 50ms on all nodes

NOTE: the runs which had lower performance numbers had a significantly higher amount of errors: INFO - {go-log} (*GoLog).Write: 2017/07/18 23:04:12 ERROR: couldn't exec INSERT statement: pq: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1500419052.641330403,0 encountered previous write with future timestamp 1500419052.641409756,0 within uncertainty interval

What did you expect to see? The performance to improve as more nodes are added.
What did you see instead? Performance reduced.

petermattis commented 7 years ago

@aselus-hub The slight dip in performance when expanding the cluster from 1 to 3 nodes is expected. You've gone from 1 replica to 3 replicas and there is a small amount of overhead to doing so. That performance dropped further when you expanded to 5 nodes is very surprising. I don't have any explanation for that other than the statement that "it shouldn't happen". Can you provide reproduction instructions?

@tschottdorf Any thoughts about the ReadWithinUncertaintyIntervalErrors?

aselus-hub commented 7 years ago

@petermattis the only current reproduction instructions I can give are the ones above, generating a graph, spliting it out into nquads and writing those quads in batches through cayley into cdb (example of quads as above) I'm going to write an example app that feeds it, but that might take a little bit (as I can't reuse work code that caused the problem originally) I will link the github repo once the example app is up.

tbg commented 7 years ago

@petermattis the first data point to look at for ReadWithinUncertaintyIntervalErrors is the clock offset. I believe there are clock_offset measurements in /_status/vars. If a node is trailing and contending with other nodes, it will often catch that error. Even with well-synchronized clocks, it could happen sometimes: txn1 starts at ts1, waits, txn2 starts at ts2 > ts1, writes, txn1 tries to write and sees value at ts2. But it doesn't seem here that overlapping transactions have a very heterogenously sized read phase.

aselus-hub commented 7 years ago

@petermattis example branch used to generate bellow numbers(apologies for the mess, i wrote it in haste): https://github.com/aselus-hub/cayley-cdb-appscaling

Here are numbers I got by using this to test (which follows the pattern, both local and remote numbers.) based on cdb nodes on AWS i2.xlarges (15GiB SQL memory)

From remote: 1 cdb node: 80/second 3 cdb nodes(with lb):74~83 sets/second. If talking though lb: 55 5 cdb nodes: if talking to one node 67~69 if talking through lb: 49

from AWS localized(5 runs each): 1 cdb node with lb 239~254 (median example, over 10seconds):

time elapsed: 10.266989253s
total processed: 2477
sets/second: 241.2586532391883

3 cdb nodes with lb 193~201 (median example, over 10second)

time elapsed: 10.169260542s
total processed: 2025
sets/second: 199.12952290253162

5 cdb nodes with lb 170~185 (median example, over 10second)

time elapsed: 10.398752349s
total processed: 1907
sets/second: 183.38738494752087

<10% cpu used on node.

petermattis commented 7 years ago

@aselus-hub Thanks for putting that example together. Will take a look tomorrow.

petermattis commented 7 years ago

@aselus-hub Wanted to let you know this is still on my radar. Apologies for not being able to take a look at a sooner. Other work has interfered.

petermattis commented 7 years ago

@aselus-hub Finally taking a look at this. I'm not sure what the cause of the slowdown moving to 5 nodes is yet. But examining the transactions I noticed that they are executing a lot of serial operations. For example:

BEGIN
SAVEPOINT cockroach_restart
INSERT INTO nodes(hash, value_string) VALUES ($1, $2) ON CONFLICT (hash) DO NOTHING;
INSERT INTO nodes(hash, value_string) VALUES ($1, $2) ON CONFLICT (hash) DO NOTHING;
INSERT INTO nodes(hash, value_string) VALUES ($1, $2) ON CONFLICT (hash) DO NOTHING;
INSERT INTO quads(subject_hash, predicate_hash, object_hash, label_hash, id, ts) VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT (subject_hash, predicate_hash, object_hash) DO NOTHING;
INSERT INTO quads(subject_hash, predicate_hash, object_hash, label_hash, id, ts) VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT (subject_hash, predicate_hash, object_hash) DO NOTHING;
RELEASE SAVEPOINT cockroach_restart
COMMIT

That's 5 separate INSERT statements. Due to its distributed nature, Cockroach operations have higher latencies. You'd be better off structuring this as 2 INSERTS:

BEGIN
SAVEPOINT cockroach_restart
INSERT INTO nodes(hash, value_string) VALUES ($1, $2), ($3, $4), ($5, $6) ON CONFLICT (hash) DO NOTHING;
INSERT INTO quads(subject_hash, predicate_hash, object_hash, label_hash, id, ts) VALUES ($1, $2, $3, $4, $5, $6), ($7, $8, $9, $10, $11, $12) ON CONFLICT (subject_hash, predicate_hash, object_hash) DO NOTHING;
RELEASE SAVEPOINT cockroach_restart
COMMIT

I'm not seeing anything in https://github.com/cayleygraph/cayley/blob/d9a72b0288ed17c0601adbc92eb7cb79e5687729/graph/sql/cockroach.go that would prohibit this.

For an additional optimization, you can use the RETURNING NOTHING syntax (https://www.cockroachlabs.com/docs/stable/insert.html) and send the INSERTs for the nodes and quads tables in the same query. This would look something like:

INSERT INTO quads ... RETURNING NOTHING; INSERT INTO nodes ... RETURNING NOTHING;

If cockroach sees a query like that it will run the INSERTs in parallel.

Now, time for me to look at the 3 node vs 5 node performance.

petermattis commented 7 years ago

On a GCE test cluster I see:

1-node:  346 ops/sec
3-nodes: 135 ops/sec
5-nodes: 130 ops/sec

I think the fall off in performance between 3 and 5 nodes is because of the relatively small amount of data in the test. That small amount of data resides in a single range and with 3 nodes there is a 1/3 chance of the leaseholder for that range being local on the node that receives the queries. With 5 nodes there is a 1/5 chance of the leaseholder being local. If the data set were larger so that it spanned multiple ranges I would expect the performance to increase with additional nodes.

@tschottdorf I can easily reproduce the ReadWithinUncertaintyIntervalError error on denim which has fairly tight clock synchronization according to the node exporter metrics. Especially if you crank up the concurrency in the test app that @aselus-hub provided. I think there is something to investigate here.

petermattis commented 7 years ago

@tschottdorf I can reproduce the ReadWithinUncertaintyIntervalError problem on a local 5-node cluster. That is, all nodes running on the same machine.

~ rm -fr cockroach-data && roachdemo -n 5 -- --insecure --logtostderr

./cayley-cdb-appscaling
2017/07/28 16:06:23 ERROR: couldn't exec INSERT statement: pq: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1501272383.609559945,0 encountered previous write with future timestamp 1501272383.617180736,0 within uncertainty interval
2017/07/28 16:06:23 ERROR: couldn't exec INSERT statement: pq: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1501272383.607930617,0 encountered previous write with future timestamp 1501272383.617180736,0 within uncertainty interval

tbg commented 7 years ago

What's the statement that does the read? I'm seeing mostly inserts up here.

One way in which this error can still pop up with perfectly synchronized clocks is simply when the "beams cross":

node1 begins transaction, picks timestamp
node2 begins transaction, picks (higher, because perfectly synchronized) timestamp, reads & inserts, commits
node 1 reads, catches uncertainty error

petermattis commented 7 years ago

Inserts do a conditional put. Isn't that a read?

The "beams cross" scenario requires contention, right? I'm not quite understanding what this test app and cayley are doing (yet).

aselus-hub commented 7 years ago

@petermattis I will try that optimization and add no return , though sadly gos cdb interface doesn't support multivalue inserts to my knowledge so I'll build a string for it. Unless there's some other way to do that in cockroach ?

In the real dataset example we where inserting about 500k nquads with the same sort of scaling results. If the target is to be able to get that up to at least 3000inserts/second for example are there any steps we can take? Or is that kind of scaling not possible with cockroach and this data model?

petermattis commented 7 years ago

Huh, the errors seem to be occurring when the same nodes.hash is inserted concurrently. I need to step out for a bit, but that seems odd.

tbg commented 7 years ago

Yes, it would require the first (but slower) txn to read something the second one wrote. I would have to check what a CPut would do, but it's possible that it would treat this as a ReadWithinUncertaintyIntervalError. Even though that would be somewhat silly: it could still look at the future value and if it doesn't match the assumption of the CPut, it may as well return a ConditionFailedError right there (I think).

petermattis commented 7 years ago

@petermattis I will try that optimization and add no return , though sadly gos cdb interface doesn't support multivalue inserts to my knowledge so I'll build a string for it. Unless there's some other way to do that in cockroach ?

Yes, you'll need to build a string, though you can still use placeholders.

In the real dataset example we where inserting about 500k nquads with the same sort of scaling results. If the target is to be able to get that up to at least 3000inserts/second for example are there any steps we can take? Or is that kind of scaling not possible with cockroach and this data model?

Yes, that scaling is possible, especially if you batch sufficiently.

petermattis commented 7 years ago

Yes, it would require the first (but slower) txn to read something the second one wrote. I would have to check what a CPut would do, but it's possible that it would treat this as a ReadWithinUncertaintyIntervalError. Even though that would be somewhat silly: it could still look at the future value and if it doesn't match the assumption of the CPut, it may as well return a ConditionFailedError right there (I think).

I think this may be what is happening. The errors are all occurring on INSERTS for which there appears to be an existing row. These are INSERT ... ON CONFLICT DO NOTHING. But the lack of ConditionFailedError is causing the insert to fail on conflict.

tbg commented 7 years ago

Ok, that seems worth fixing then. Just to be sure, you're also seeing old-fashioned WriteIntentErrors, right? Those would be expected if the "faster" value is in fact an intent. Or perhaps you're not seeing those because 1PC optimization?

petermattis commented 7 years ago

@tschottdorf These are definitely not 1PC transactions. I haven't actually looked down at what is happening at the storage layer yet.

I'm seeing INSERTS into the nodes table fail where the hash is ded76379665e2a11c953fbfeca46541832498e61. That is the node for the related_through predicate (literally SHA1("related_through")). @aselus-hub I'm not terribly familiar with cayley, but over a 10 second run where I inserted 2622 nodes, we performed 2218 inserts of the "related_through" node. That seems sub-optimal.

aselus-hub commented 7 years ago

@petermattis Cayley is a graph db that uses cockroach as a storage layer(not of my writing ;-) though I am going to try and write/test/submit your advice into its cockroach connector).

Essentially each nquad is a relationship representation of: parent/subject ---(edge/predicate)---> chid/object [tag/label] Each of the items in the nquad relationship is represented as a node in the nodes table, and the relationship is represented in the nquads table.

The reason it tries to re-insert the vertex is because it in and of itself does not know that the predicate already exists, so it just does a forceable/ignorable write... I could pottentially create an LRU or ARC cache that keeps track of what nodes have already been inserted and reduce the amount of collisions that happen in the nodes list, would that help[or top of the RETURNING NOTHING]?

petermattis commented 7 years ago

The reason it tries to re-insert the vertex is because it in and of itself does not know that the predicate already exists, so it just does a forceable/ignorable write... I could pottentially create an LRU or ARC cache that keeps track of what nodes have already been inserted and reduce the amount of collisions that happen in the nodes list, would that help[or top of the RETURNING NOTHING]?

Yes, avoiding reinserting the node/vertex would likely help, though the batching of INSERTs will probably help more. Note that my suggestion regarding the batching of INSERTs will help any sql backend. It just has a more noticeable affect on Cockroach where the latencies are higher.

aselus-hub commented 7 years ago

@petermattis did modifications as perscribed to the following results on local machine: Creating bulk inserts: doing 15+ inserts at a time upped performance to ~1700/second(more then that degraded performance, which I found odd since the documentation said \~500 was ideal. But if i clocked it up to anywhere near there then the performance dropped significantly [to \~100 sets]). Is this as expected?

"RETURNING NOTHING" : did not effect performance [my guess is because this is on local machine where return latency is low?].

Will do scaling test next, running it with more data for longer, any recommendation on how much data I should use for the test?

petermattis commented 7 years ago

@aselus-hub Glad to hear about the progress. Not sure why more that 15 inserts is showing degraded performance. Definitely not expected. It might be related to the indexes cayley uses. We'd be happy to investigate.

In order to see a benefit from more machines, you'll need enough data to occupy multiple ranges. A range is 64MB is size and it splits when it becomes larger. You'll want to test with a data set significantly larger than 64MB (e.g. 1GB).

PS Can you point me towards your edits to cayley? We'll need them to investigate the performance oddity with batches larger than 15.

aselus-hub commented 7 years ago

@petermattis I uploaded the update to both, the cockroach driver updates are in: https://github.com/aselus-hub/cayley The new changes in cdb_graph_gen extend the batch processing to be more then just one set of nquads(using the 'batchSize' const as the apply tx threshold).

petermattis commented 7 years ago

@aselus-hub Perusing https://github.com/aselus-hub/cayley/blob/master/graph/sql/cockroach.go#L295-L301 I noticed that you're preparing a statement and then immediately executing it. The prepare requires a roundtrip to the server. You'd be better off skipping the prepare in which case the driver can pipeline the prepare and exec. Something like:

        _, err = tx.Exec(insertQuadStr, quadValues...)

In order to get the 1 round-trip behavior, I think you might also need to set binary_parameters=true in your postgres URL.

aselus-hub commented 7 years ago

@petermattis made the modification, I'm going to try and run a prolonged test overnight to see what kind of performance numbers come from it. As such i also added a regular every 'n' report status to the test app.

I tried smashing the two inserts into one exec as you had recommended before but it told me this was not possible, as it denoted two inserts in one prepared statement (even after i removed the prepare)

petermattis commented 7 years ago

I tried smashing the two inserts into one exec as you had recommended before but it told me this was not possible, as it denoted two inserts in one prepared statement (even after i removed the prepare)

Yes, you can't use multiple statements with prepared statements. I think that's a limitation of the postgres wire protocol.

aselus-hub commented 7 years ago

@petermattis finished running a 14hr test set with 3 of the example applications and 5 cockroachDB isntances+lb

Starting performance: ~200node/sec (total of ~600)
Inserts/second within the first hour(According to the UI) : 100/sec

Ending performance: ~65node/sec (total of ~195)
Ending total data ingested: 60.8gbs (out of a total of  3.6Tb)
Inserts/second towards the end of the 14 hours(according to the UI) : 40/sec

Average over all time: ~80/node

I have the output files for this if a slope is needed.

So it seems as data grew the insert time sadly dropped more and more for writes, so increase of dataset/ranges did not speed up ingestion rate. This was with the same code that's 2k/second local[during quick tests], and 1000msgs/sec 1 writer - 1 cdb and 800msg/sec 1 writer - 1 cdb setups. So the scalability experiment is still a failure for this kind of table count.

Any other thoughts or should I close the experiments?

Thanks again for all the help.

petermattis commented 7 years ago

@aselus-hub On the admin UI, can you take a look at the "Ranges", "Replicas per Store" and "Leaseholders per Store" graphs? I'm curious to know what the count of ranges is and how balanced the replicas and leaseholders are.

aselus-hub commented 7 years ago

@petermattis Snapshot info taken at end of run, all in all the numbers aren't that far off, though a bit smaller then I would have expected: Ranges:

Ranges: 350
Leaders: 350
Lease Holders: 349

Replicas per node:

Node 1 - 214
Node 2 - 208
Node 3 - 203
Node 4 - 210
Node 5 - 215

Leaseholders per Store

Node 1 - 69
Node 2 - 72
Node 3 - 66
Node 4 - 75
Node 5 - 67

Bonus Node summary information:

ID | ADDRESS | UPTIME | BYTES | REPLICAS | MEM USAGE | LOGS
1 | ip- xx | 18 hours | 7.7 GiB | 217 | 11.7 GiB | Logs
2 | ip- xx | 18 hours | 7.9 GiB | 211 | 11.8 GiB | Logs
3 | ip- xx | 18 hours | 7.5 GiB | 203 | 12.9 GiB | Logs
4 | ip- xx | 18 hours | 7.4 GiB | 213 | 12.6 GiB | Logs
5 | ip- xx | 18 hours | 7.6 GiB | 215 | 11.8 GiB | Logs

5
TOTAL LIVE NODES
38.1 GiB
TOTAL BYTES
1059
TOTAL REPLICAS
60.8 GiB
TOTAL MEMORY USAGE

aselus-hub commented 7 years ago

@petermattis any other data I should pull from there before I shutdown the cluster?

petermattis commented 7 years ago

@aselus-hub Nope, that is sufficient. The replica and leaseholder balance look good. Go ahead and shut it down. I'll take a crack at running your test app on a real cluster and see if I notice anything.

bdarnell commented 7 years ago

I would have to check what a CPut would do, but it's possible that it would treat this as a ReadWithinUncertaintyIntervalError.

Yes, this appears to be the case.

Even though that would be somewhat silly: it could still look at the future value and if it doesn't match the assumption of the CPut, it may as well return a ConditionFailedError right there (I think).

ConditionFailedError returns the "current" value for the key. It seems risky to look into the future and return the value that's there. And what if you look into the future and see a record that does match the CPut's requirement? We can't return nil in that case (since the condition is not true at the transaction's proposed timestamp), so we still have to return ReadWithinUncertaintyIntervalError.

dsymonds commented 6 years ago

FYI, I'm interested in seeing something improved here. I hacked up the equivalent of what @petermattis suggested in https://github.com/cockroachdb/cockroach/issues/17108#issuecomment-318739472 into the latest version of Cayley's code (which has unfortunately moved on a fair bit from the base of @aselus-hub's fork), and I saw a 4-5x speedup on doing a large number of insertions.

petermattis commented 6 years ago

@dsymonds I can provide additional guidance on performance changes to Cayley, but you will either have to shepherd those upstream or convince the Cayley folks it is worth fixing themselves. As I mentioned in that comment, round-trips affect Cockroach performance more than traditional databases. There are usually ways to structure an application's logic to reduce the round-trips. Is there something else you're looking for here?

dsymonds commented 6 years ago

No, I'm looking into pushing the improvements upstream to Cayley. I just wanted to note the magnitude of the speedup that I observed. It'd be nice if CockroachDB did it automatically (that is, coalescing value insertions inside a transaction that have conflict resolution and don't return values), but it's not as necessary for my specific use case.

petermattis commented 6 years ago

It'd be nice if CockroachDB did it automatically (that is, coalescing value insertions inside a transaction that have conflict resolution and don't return values), but it's not as necessary for my specific use case.

Agreed that fixing this in Cockroach would be ideal. There are a few ideas and experiments in this area that we'll be investigating for the 2.1 release (scheduled for October).

dsymonds commented 6 years ago

I've filed cayleygraph/cayley#691 to chase this upstream.

dsymonds commented 6 years ago

FYI, I fixed cayleygraph/cayley#691 based on the ideas here (multi-value insert statements, and avoiding returning data when not needed). @petermattis If you have any other suggestions for the code affected there (see b49c06e), I'd be happy to try it out and see if I can get further performance improvements.

petermattis commented 6 years ago

@dsymonds Very nice. I took a quick look at your change. A definite improvement. Some comments below:

fmt.Fprint(&query, ";")

This shouldn't be necessary. A semicolon is only necessary to terminate a statement if the query contains multiple statements.

_, err := tx.Exec(query.String(), allValues...)

The use of placeholders triggers different code paths within the lib/pq driver. I can't recall if lib/pq pipelines the low-level PREPARE/BIND/EXEC operations or if it sends them one at a time. There is another Go postgres driver which the knowledgeable Cockroach engineers say is much better: https://github.com/jackc/pgx. I'm not sure if it would have an effect here, but it is worth a small bit of time exploring. The other option is to specify binary_parameters?true in the postgresql:// URL. A brief reading of the lib/pq source shows that this sends the PREPARE/BIND/EXEC operations before waiting for any results.

Yet another area to experiment with is to avoid using placeholders at all. This is somewhat more dangerous as you'd have to guarantee you're properly quoting the values in the query. The advantage is that you can send multiple semicolon separated statements in a single call to tx.Exec(). Ideally, this shouldn't have any benefit given you're specifying RETURNING NOTHING.

tbg commented 6 years ago

@petermattis anything concrete left to do here? Based on my quick skim of the thread, this might be worth considering:

https://github.com/cockroachdb/cockroach/issues/17108#issuecomment-318761706

petermattis commented 6 years ago

I've created #28461. I read through the rest of this issue and don't see anything else that is actionable.

cockroachdb / cockroach

perf: scaling cockroach clusters while running under cayley, slows performance #17108