storage: unexpected IO overload in admission/follower-overload/presplit-control

tbg commented 2 years ago

The admission/follower-overload/presplit-control test is introduced in https://github.com/cockroachdb/cockroach/pull/81516.

Despite the name, no overload should occur in it. We run two workloads:

4mb/s of kv0 spread out evenly across leaseholders n1 and n2 (n3 pure follower)
a trickle workload of kv50 hitting n3 as leaseholder (n1 n2 pure followers)
when run on AWS (which is what I am doing), it runs on standard 125mb/s 3000iops gp3 volumes

The expectation is that this "just runs fine until out of disk". The reality is that a couple of hours in, n1 and n2 end up in admission control due to their LSM inverting.

I've dug at this quite a bit but think that I have ruled out (or at least not seen evidence of)

throughput and IOPS overload (the disk on n3 gets way more read mb/s and write mb/s despite n1 and n2 having been bumped to 6k iops)
CPU overload (CPU is mostly idle on all machines, despite a somewhat higher baseline on n1 and n2, which is expected due to kv0)

I recorded my detailed analyses as looms. They're a couple minutes each. There's also a slack thread that might pick up some more minute details as the day goes by but anything major will be recorded here.

part 1: finding out that the cluster is in bad health and looking atr graphs https://www.loom.com/share/5d594ac594f64dd3bd6b50b8b653ca33

look at lsm visualizer for https://gist.github.com/tbg/7d47cc3b6bfc5d9721579822a372447e https://gist.github.com/tbg/c1552f8d92583c91f9996323608c647e https://gist.github.com/tbg/83d49ce2c205121b17b32948de1720b8: https://www.loom.com/share/2e236668c1bb4a67b52df5b64e0c231f

with both the IOPS upped and n1,n2 restarted with compaction concurrency of 8, both of them still seem overloaded: https://www.loom.com/share/8ac8b33d082645ce9aff780eeedd00cb

It's too early to really tell, though, since the leases still have to balance out. I will comment in a few hours when a new steady state has been reached.

Related to #79215

Jira issue: CRDB-16213

Epic CRDB-15069

tbg commented 2 years ago

I turned the kv0 workload off to let n1 and n2 get healthy again (they were so unhealthy that they'd periodically miss heartbeats and lose all leases). Doing so was instructive because, uh, what's pebble doing with its newfound freedom? It's certainly not cleaning up L0. The workload stopped at 8:37.

My perhaps naive reading is that pebble is working as fast as it can, but it's not clearing out L0, presumably because it needs to clear out the lower levels of the LSM first to make room (while still receiving a small amount of writes to the LSM via the kv-50 workload which writes at 10kb/s, i.e. a negligible amount, via raft appends to n1 and n2)

A few minutes later, L0 did clear out, and interestingly for the minutes prior to this, we see compaction throughput rise significantly:

My (again, naive) interpretation of this is that possibly some larger compactions were blocking L0->Lbase compactions that could have used more of the concurrency slots. Somehow the LSMs on n1 and n2 spent too much time in a regime where they weren't able to fully use the disks. I assume someone more knowledgeable can see things like that from the lsm log inspections linked in the initial post.

tbg commented 2 years ago

It took quite some time (50 minutes perhaps) for n1 and n2 to get done with their outstanding compactions:

I'm going to restart the workload and will expect the nodes to re-enter the bad regime in an hour or two.

Btw, the cluster is tobias-1653898349-02-n4cpu4.

tbg commented 2 years ago

Update: a few hours in we're back to IO overload. Notably, this is also starting to affect n3, which had previously been unaffected. This makes some sense, since we doubled the IOPS for n1 and n2, so maybe they're slowing down "less" due to their own overload and manage to push n3 over the edge as well.

nicktrav commented 2 years ago

Dropping in some comments based on some observations yesterday.

nodes were in admission control territory due to the number of L0 sublevels - files were in the 1k range on the nodes I checked
looking at the compaction logs, all compaction slots were being used (8 slots, 1 dedicated for flushes, leaving 7 for compactions) - I put the output in this gist
IOPs didn't seem to be an issue
sum(read throughput + write throughput) on n2 was close to, if not exceeding the provisioned value of 125MB/s
after bumping the throughput up to 250MB/s, n2 was able to make progress with clearing out L0 and avoiding any throttling

Having all nodes on 250MB/s seemed like it helped for a while, though it sounds like we ended up in a throttling regime again.

Understanding the throughput characteristics of the gp3 EBS volumes seems prudent. Current observations seem to indicate that we can drive significantly more throughput on the device than provisioned.

tbg commented 2 years ago

I've since run multiple iterations of this experiment. For the minute details, I will refer to these slack threads:

number one, this one has more stumbling so probably better skipped by most

number two which just "sort of" wrapped up as I'm typing this.

There were more precursor slacks over the last couple of weeks, but I think they are less important so I'm not going to dig them up.

The high-level narrative that seems to emerge is that we were really dealing with gp3 behavior that we didn't properly understand. The experiment initially starts out with a 500gb gp3 disk provisioned to 125mb/s and 3k iops. It turns out though, that you really get a RAID0 of two EBS volumes:

nvme2n1     259:0    0  93.1G  0 disk
└─md0         9:0    0 592.9G  0 raid0 /mnt/data1
nvme1n1     259:1    0   500G  0 disk
└─md0         9:0    0 592.9G  0 raid0 /mnt/data1

Now why the first one is 93.1G and the second one is 500G, don't ask me. I would love to learn if this setup is somehow accidental or whether that is how it always works. This is how the VMs are created, looks pretty vanilla:

13:58:03 roachprod.go:1240: Creating cluster tobias-1654264678-01-n4cpu4 with 4 nodes: {ClusterName:tobias-1654264678-01-n4cpu4 Lifetime:12h0m0s CustomLabels:map[] GeoDistributed:false VMProviders:[aws] SSDOpts:{UseLocalSSD:false NoExt4Barrier:true FileSystem:ext4} OsVolumeSize:10}

But there is also this --aws-enable-multiple-stores flag and things start to really look suspicious here, now that I'm paying attention. This second 94gb disk is not something I have bumped the throughput for. So where is that disk? It's not an attached disk:

so it's going to be the local nvme ssd attached to the c5d.xlarge: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#instance-store-volumes

That probably already explains everything - we shouldn't be lumping the local SSD into the RAID0 because in doing so, we won't reliably get a speed-up unless both volumes have been provisioned up, and we can't provision up the local SSD (I think).

nicktrav commented 2 years ago

I modified your roachtest from #81516 to do something like:

c := r.MakeClusterSpec(4, spec.ReuseNone(), spec.AvoidSSD())
c.InstanceType = "c5.xlarge"

Using the c5 without the d on the end ensures we don't get a local SSD instance store in there, and we instead use the EBS vol created due to spec.AvoidSSD() (or PreferLocalSSD = false under the covers).

Confirmed that we only get a single EBS vol on the node, so no RAID in the mix to complicate things.

I've had the roachtest running for an hour or so, and I'm starting to see the effects of maxing out the EBS volume. Here's a look at read, write and combined throughput. The ceiling is 125 MB/s, as expected.

I'm not going to dig in too deep at this point. Just wanted to confirm that we can at least work around the mixed SSD / EBS volume issue by updating the instance type.

tbg commented 2 years ago

Closing this issue since we understand what's going on, and since https://github.com/cockroachdb/cockroach/issues/82423 tracks fixing this footgun.

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

tbg commented 2 years ago

Unfortunately this still occurs on an EBS-only volume. I have been running a cluster which started out on the 125mb/s / 3k iops volumes and experienced IO overload, as expected. I reprovisioned the volumes 500mb/s and 6k iops at 7:38am UTC, which is 8 hours ago. The volumes moved to "optimizing" state (where their performance is somewhere between old and new provisioned) and have been in regulra "in use" state for a few hours. However, there is IO overload again at a throughput ceiling of 144mb/s (matching precisely the old ceiling), and additionally I am unable to squeeze out additional throughput via fio on the affected machines (fio --name=write_throughput --directory=/mnt/data1/foo --numjobs=8 --size=10G --time_based --runtime=600s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=write --group_reporting=1 --refill_buffers); all this does is starve pebble of throughput. Tools like ioping -y also confirm that the volume is overloaded, with latencies ranging into the double-digit ms range. So these nodes are pushing what the hardware will let them.

tbg commented 2 years ago

Stopping CRDB and running fio showed something interesting - we were bursting at 500mb/s (the provisioned throughput), but only for a few seconds:

$ fio --name=write_throughput --directory=/mnt/data1/foo --numjobs=8 --size=10G --time_based --runtime=600s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=write --group_reporting=1 --refill_buffers
write_throughput: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
...
fio-3.16
Starting 8 processes
Jobs: 8 (f=8): [W(8)][3.8%][w=498MiB/s][w=498 IOPS][eta 09m:40s]

then we immediately dropped back to the baseline we also saw CRDB operate at:

Jobs: 8 (f=8): [W(8)][12.9%][w=138MiB/s][w=138 IOPS][eta 08m:45s]

tbg commented 2 years ago

@nicktrav going to hand it back to you. The cluster in question is here. I stopped CRDB on n1; it is where the above commands were run instead (but I stopped fio now).

The cluster has 17h to live, I'll give it another 24h but please extend it as necessary after that.

tbg commented 2 years ago

Just tried reads - I seem to be getting 500 mb/s read throughput on n1, and have been for 7min:

mkdir -p /mnt/data1/foo && fio --name=read_throughput --directory=/mnt/data1/foo --numjobs=8 --size=1G --time_based --runtime=6000s --ioengine=libaio --bs=1M --iodepth=64 --rw=read --group_reporting=1
read_throughput: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
...
fio-3.16
Starting 8 processes
Jobs: 8 (f=8): [R(8)][7.5%][r=502MiB/s][r=501 IOPS][eta 01h:32m:29s]

so I tried the writes again and it seems that we're getting it there now, too:

$ fio --name=write_throughput --directory=/mnt/data1/foo --numjobs=8 --size=10G --time_based --runtime=600s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=write --group_reporting=1 --refill_buffers
write_throughput: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
...
fio-3.16
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][w=501MiB/s][w=501 IOPS][eta 00m:00s]
write_throughput: (groupid=0, jobs=8): err= 0: pid=221609: Wed Jun  8 19:18:30 2022
  write: IOPS=500, BW=502MiB/s (526MB/s)(294GiB/600051msec); 0 zone resets
    slat (usec): min=34, max=233260, avg=15809.07, stdev=16041.16
    clat (msec): min=12, max=2309, avg=1005.56, stdev=166.45
     lat (msec): min=17, max=2342, avg=1021.37, stdev=168.33
    clat percentiles (msec):
     |  1.00th=[  422],  5.00th=[  793], 10.00th=[  869], 20.00th=[  919],
     | 30.00th=[  953], 40.00th=[  978], 50.00th=[ 1003], 60.00th=[ 1028],
     | 70.00th=[ 1062], 80.00th=[ 1099], 90.00th=[ 1167], 95.00th=[ 1234],
     | 99.00th=[ 1552], 99.50th=[ 1653], 99.90th=[ 1838], 99.95th=[ 1905],
     | 99.99th=[ 2106]
   bw (  KiB/s): min=159744, max=1607680, per=99.83%, avg=512809.62, stdev=14055.63, samples=9600
   iops        : min=  156, max= 1570, avg=500.58, stdev=13.73, samples=9600
  lat (msec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.08%, 500=1.59%
  lat (msec)   : 750=2.55%, 1000=44.49%, 2000=51.42%, >=2000=0.02%
  cpu          : usr=1.06%, sys=0.31%, ctx=416141, majf=0, minf=120
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,300492,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=502MiB/s (526MB/s), 502MiB/s-502MiB/s (526MB/s-526MB/s), io=294GiB (316GB), run=600051-600051msec

Disk stats (read/write):
  nvme1n1: ios=67/1208138, merge=0/55, ticks=730/36004820, in_queue=33359060, util=100.00%

In the meantime, n2 (which wasn't down) is still constrained. Maybe these volumes need some rest before they can crank out the 500m? Going to stop n2 and restart n1.

tbg commented 2 years ago

No, n2 is only getting 138MiB/s on either reads or writes.

DuskEagle commented 2 years ago

Following up on the throughput limit we were hitting above, the reason we were hitting this limit is that the test was using a c5.xlarge instance type. This instance type can "burst" to 593.75MB/s throughput for 30 minutes at a time, but its baseline throughput is 143.75MB/s.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html

To achieve a consistent throughput of 500MB/s, we should rerun the test with a c5.4xlarge or larger machine.

nicktrav commented 2 years ago

Closing this one out, now that we understand how EBS works :)

cockroachdb / cockroach

storage: unexpected IO overload in admission/follower-overload/presplit-control #82109