cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

kv: write a test for transaction starvation #628

Closed petermattis closed 7 years ago

spencerkimball commented 9 years ago

We restart serializable transactions whenever they're forced to adjust their time stamps over the course of execution. This occurs because they try to write to keys that were more recently read than the time stamp at which the Txn started. We use a priority system which exponentially bumps a restarted txn's priority, so starvation, even in a situation with a lot of contention, is avoided in log time. Would be great to actually test this though! That test would involve a relatively long running txn...say writing to 100 keys while many other concurrent readers are accessing the same keys. Such a Txn should only restart once according to our current implementation.

The more interesting restart case is a relatively long running Txn trying to write 100 keys with many concurrent writers also contending for same keys. In this case the long running Txn is expected to abort and restart and we expect this to happen log(N) times, where N is the number of concurrent writers.

cosmicturtlesoup commented 9 years ago

I'd be interested in tackling this issue...

tbg commented 9 years ago

@cosmicturtlesoup cool. It's all yours.

tbg commented 9 years ago

via @cosmicturtlesoup:

Large numbers of concurrent txns against a single key on a single node will fail to commit. Small numbers work fine.

For example, 55 concurrent txns finish in just a few seconds, but if the same test is run with 105 concurrent txns, only a small number complete (like 5 or so) and even those take a long time to commit... and then the system fails to commit any further txns and eventually times out.

Tested in both CentOS7 and Ubuntu 15.04. I'm submitting a pull request shortly which demonstrates this issue.

Sample args (test added in pull req): make acceptance TESTS=TestSingleKeySingleNodeConcurrentTxn TESTFLAGS="-num 3 -numTrans 55 -d 4s -v"

See example for output of a small test which passes and larger test which fails.

It's normal that things will fall into pieces eventually (with any database). The test in #1959 is the definitive worst case: all Txns fighting for a single key. The goal here is more fine-grained: compare the expected performance against the actual one. It's still interesting to see where between 55 and 105 destructive effects creep in, though. That would involve measuring restarts (as suggested in #628) and run times of individual transactions to see if there are any sudden spikes or races emerging. For starters, could you sample total run time of the test with k=5,10,...,105 concurrent txns? Would be interesting to see that curve.

cosmicturtlesoup commented 9 years ago

Actually, I have code to capture the restart counts but had distilled that out to the simplest change that shows the timeout behavior. I'll add that code back in and report back.

In the meantime, attached are the results for 4 runs of between 5 and 80 or so concurrent txns. As you can see from the numbers, when things run sanely the process finishes in under 30s, but if/once things go bad, they stay bad until the timeout period.

Run4 is of particular note because there was a 5min timeout at 55 and 80 txns but 60/65/70/75 were ok.

The other runs were terminated manually after several minutes of no progress. Run1 chewed up several GB of ram before I stopped the process. The other runs were more well behaved (didn't grow in ram usage much even during the "no progress" phase -- where no progress means no commits over a minute or two).

concurrent_txn_timeout_analysis

cosmicturtlesoup commented 9 years ago

@tschottdorf: Question -- I'm exposing the client.Txn.txn to get at the proto.Transaction.Epoch which I understood to be the count of the txn restarts. However, Epoch+1 doesn't always match the number of times the txn closure is called -- there's no obvious pattern as to how much they differ. Should the Epoch count in client.Txn.txn be the number of retries, or am I looking at the wrong Epoch field?

tbg commented 9 years ago

the closure is called again after each restart, but also after each abort. It's going to be most convenient if you start counting the calls, probably no need to distinguish between restarts and aborts (after all, your test only has write/write conflicts and they will always abort someone). That saves you the need to expose any of the internals. If things get slow (and nothing is seriously wrong) then likely transactions just wind up backing off for way too long, to the point where they choke each other to death. Could you investigate a few individual transaction's traces and present them here? You can get traces via --vmodule=monitor=2. You can grep out transactional traces by grepping for '^t' (or some variation thereof) and then the output can be piped through sort. Then just look at some of the traces that took a long time and post some of them. One thing that you might want to try is hard-coding a lower initial and maximum backoff in store.go:

var (
    // defaultRangeRetryOptions are default retry options for retrying commands
    // sent to the store's ranges, for WriteTooOld and WriteIntent errors.
    defaultRangeRetryOptions = retry.Options{
        InitialBackoff: 50 * time.Millisecond,
        MaxBackoff:     5 * time.Second,
        Multiplier:     2,
    }

Try 50ms max and initial 5ms and see whether it dramatically improves results.

cosmicturtlesoup commented 9 years ago

I tried the 50ms max and 5ms initial backoff and got essentially the same results -- 55 concurrent txns to a node will complete in a smallish number of seconds, whereas 85 concurrent txns to a node will complete several txns, then fail to make any more progress.

Just to clarify though, it's not just slow for 85 txns, it's that something appears to be seriously wrong, in that even though there are a handful of retries per second, none succeed in commiting (after the first few)... here is a summary per second (where "expected" is the count of commits):

retry counts

Note that after the first 14 commits which occur in the first 10 or so seconds, no further commits occur... eventually I stopped the job (ctrl-c).

The args were: 85 concurrent txns to node 0 (3 nodes spun up, only one used for txns), first txn with a 3 sec sleep.

I tried adding the debug flag you mentioned but I'm not seeing what I should sort on, so I uploaded the log. I'd be more than happy to process it in some way. I assumed you wanted the txns sorted by duration, but since none complete (after the first 14), there's nothing I see to sort. Also, you may have meant I should capture the server log from the Docker node, not the client log -- if so, can you let me know where in the Docker container the server log resides?

complete log for run -- shows txn retries about 3 a second, with no commits occurring

cosmicturtlesoup commented 9 years ago

For anyone trying to follow along, I'll just take a moment to recap how we got to this point and what I'm attempting, since we're on a fairly long sidebar from the original issue #628.

Originally I started with acceptance/single_key_test.go since that was already set up to test N concurrent txns against the same key, where N is the number of nodes spun up. On each node there is only one txn started/executed at a time (txns are done in a loop, not via goroutines).

To create a long txn, I added a sleep in the middle of the closure for the first txn.

I could only get about 25-30 nodes spun up (more ran out of memory). Note that for single_key_test.go there's just one active txn per node, so the number of nodes equals the number of concurrent txns.

With that small number of 25-30 concurrent txns, we can show that read/write concurrent txns restart the write txn once. We can also show that for write/write concurrent txns that the long writer will restart a few times. This is indicative of being consistent with log N behavior, but with such small N it wasn't conclusive. If we just wanted to confirm the gist of the behavior, then I can report those numbers and submit the code and we can close out the original #628 read/write and write/write investigation. Let me know if we should just report the numbers we have so far.

I wanted to test much larger N than 30, so since I couldn't increase the node count, I decided to increase the number of concurrent txns per node to M via goroutines (for N*M concurrent txns). The test timed out for multiple nodes, so I just ran concurrent txns against one node -- same result, the test timed out. Further debugging showed that I could get about 50 concurrent txns to commit on a node, but larger numbers like 85-100 would stop working after a few commits -- even though the cpus were cranking. Thus, I filed #1958 with example code for #1959. Basically, rather than slowly degraded performance, the system grinds to a halt after a small number of txns commit. @tschottdorf moved #1958 here and is guiding me through analysis of the issue... which brings us up to date.

cosmicturtlesoup commented 9 years ago

@spencerkimball, @tschottdorf: can you clarify about what is meant by "concurrent writers" in the original #628? When you say "we expect this to happen log(N) times, where N is the number of concurrent writers", do you mean the number of writers that start during the long txn (whether they're finished or not), or the number of writers that are active (started but not commited) at any given time when the long transaction is active? So, case 1 or case 2 below?

Case 1: For example, you could have a long transaction in thread1 while in thread2 you execute say 100 transactions serially (and thread2 stops when the thread1 long txn commits). In this case, each time the long txn is retried, there is at most 1 active transaction it has to compete with.

Case 2: Or, you could have a long transaction in thread1 with 100 threads in parallel where each thread is executing txns one after another. In this case, each time the long txn is retried, it has to compete with 100 active txns.

tbg commented 9 years ago

I tried adding the debug flag you mentioned but I'm not seeing what I should sort on, so I uploaded the log

Sorry, I was talking about the server-side log. If you're running docker, I've found it easiest to get that by some invocation of docker logs $(docker ps --no-trunc | grep /data0 | awk '{print $1}') &> log0 & sleep 1 && less log0. The traces (see #1446) need higher verbosity logging, so you'll want to add --vmodule=monitor=2 to the []string in acceptance/localcluster/localcluster.go#startNode().

I should put some more work into making the traces easily accessible using the UI. It's a bit tedious to fish them from the logs.

tbg commented 9 years ago

Should also add that while theoretically the acceptance tests log to files as well, they do log in proto format and cockroach log is really annoying to work with and, at least the last time I tried, the logging was actually broken. I'll re-check that and post an issue.

cosmicturtlesoup commented 9 years ago

Was going to capture the server logs, but after merging, I'm getting new behavior that's sending out an error and stopping after a few seconds instead of timing out the test after 5 minutes. This new error occurs when I try a moderate number of concurrent txns (35+). For about 35 concurrent txns, sometimes all txns commit and the test completes -- other times the error message occurs (so, on success the txns commit, but on failure I get this new error message instead of the previous timeout behavior due to no further commits -- let me know if you want the logs). For larger numbers of txns (like 85) I always get the error message:

E0817 21:10:36.822428 34116 client/txn.go:350 failure aborting transaction: kv/txn_coord_sender.go:408: cannot commit a read-only transaction; abort caused by: kv/txn_coord_sender.go:370: transaction must not write on multiple coordinators E0817 21:10:36.822469 34116 acceptance/single_key_transaction_onenode_test.go:152 problem with transaction --- FAIL: TestSingleKeyTransactionRetryOneNode (21.24s) single_key_transaction_onenode_test.go:174: kv/txn_coord_sender.go:370: transaction must not write on multiple coordinators

It would seem this error is invalid -- the txn just does a read and a batch write (just one write).

Here's the closure for the txn -- hasn't changed from what I was using before:

            err := db.Txn(func(txn *client.Txn) error {
                r, err := txn.Get(key)
                if err != nil {
                    return err
                }
                var v testVal
                if err := v.UnmarshalBinary(r.ValueBytes()); err != nil {
                    return err
                }
                b := &client.Batch{}
                b.Put(key, v+1)
                return txn.CommitInBatch(b)
            })

Just guessing about the cause of the error -- if a txn aborted and is being retried, does the flag that marks that a transaction has attempted writes before get cleared for each retry?

bdarnell commented 9 years ago

I think the erroneous "multiple coordinator"/"can't commit read-only txn" messages are caused by the problem being fixed in #2128.

tbg commented 9 years ago

Yeah, #2128 is in so hpoefully you shouldn't see that any more.

cosmicturtlesoup commented 9 years ago

I see the #2128 issue much less, but it just occurred (once out of about 10 runs so far). I'll keep an eye out and see if I can find args that make it repeatable.

For the other runs, I'm back to seeing the timeout behavior so am working on getting the server logs.

tbg commented 8 years ago

The ledger test is a good basic test of how well we do under contention (spoiler alerts: not that well). What's very obvious running that test is that there's a definite degradation over time (we might be clogging some pipes with async intent resolutions).

I'm going look into this on the side (currently experimenting with a contention handling that uses the txn timestamp as a priority (older is better). Obviously that works way better for this type of workload, but still doesn't give satisfactory results (presumably due to the clogging).

spencerkimball commented 7 years ago

This has been addressed via #13501