etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.82k stars 9.77k forks source link

Slow ETCD on Local NVME GCP SSD #13053

Closed zorro786 closed 3 years ago

zorro786 commented 3 years ago

Hi!

I have an etcd cluster of 3 nodes setup with n1-standard-16 machines, with dedicated 320 GB local SSD.

I have benchmarked etcd using the provided tool, with following results (when there is almost no load on cluster):

./benchmark --endpoints=${HOST_1},${HOST_2},${HOST_3} --conns=100 --clients=1000     put --key-size=8 --sequential-keys --total=100000 --val-size=256
...
Summary:
  Total:    4.1823 secs.
  Slowest:  0.2182 secs.
  Fastest:  0.0266 secs.
  Average:  0.0416 secs.
  Stddev:   0.0153 secs.
  Requests/sec: 23910.3658

Both writes and read latencies are high compared to etcd benchmark numbers given.

AVG RTT between peers is 0.308ms. What can be causing these latencies?

I have used fio to check my SSD and the results are satisfactory there

sudo fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest

mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.27
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=1591KiB/s][w=708 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=14492: Thu May 27 15:05:43 2021
  write: IOPS=782, BW=1758KiB/s (1800kB/s)(22.0MiB/12816msec); 0 zone resets
    clat (usec): min=3, max=3487, avg=12.47, stdev=49.44
     lat (usec): min=3, max=3488, avg=12.88, stdev=49.44
    clat percentiles (usec):
     |  1.00th=[    5],  5.00th=[    7], 10.00th=[    8], 20.00th=[    9],
     | 30.00th=[   10], 40.00th=[   11], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   16], 95.00th=[   19],
     | 99.00th=[   29], 99.50th=[   33], 99.90th=[   49], 99.95th=[ 1270],
     | 99.99th=[ 1663]
   bw (  KiB/s): min= 1513, max= 2246, per=100.00%, avg=1762.44, stdev=229.75, samples=25
   iops        : min=  674, max= 1000, avg=784.88, stdev=102.25, samples=25
  lat (usec)   : 4=0.37%, 10=39.58%, 20=56.02%, 50=3.94%, 100=0.02%
  lat (usec)   : 750=0.01%
  lat (msec)   : 2=0.06%, 4=0.01%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=342, max=5939, avg=1262.10, stdev=641.15
    sync percentiles (usec):
     |  1.00th=[  420],  5.00th=[  474], 10.00th=[  506], 20.00th=[  562],
     | 30.00th=[  635], 40.00th=[  766], 50.00th=[ 1401], 60.00th=[ 1565],
     | 70.00th=[ 1729], 80.00th=[ 1893], 90.00th=[ 2057], 95.00th=[ 2212],
     | 99.00th=[ 2474], 99.50th=[ 2606], 99.90th=[ 3359], 99.95th=[ 4686],
     | 99.99th=[ 5342]
  cpu          : usr=0.39%, sys=4.88%, ctx=24495, majf=0, minf=17
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1758KiB/s (1800kB/s), 1758KiB/s-1758KiB/s (1800kB/s-1800kB/s), io=22.0MiB (23.1MB), run=12816-12816msec

Disk stats (read/write):
  sda: ios=0/25593, merge=0/11237, ticks=0/11808, in_queue=156, util=0.53%

I have also disabled auto write cache flushing in ext4 FS that the NVME disk uses and followed all recommendations by etcd.

ptabor commented 3 years ago

Both writes and read latencies are high compared to etcd benchmark numbers given.

What do you consider as a baseline here ?

Where (on which VM) do you run the load generator ? Do you run all the nodes in a single zone or across the zones of a single region ? Have you configured compact placement policy: https://cloud.google.com/compute/docs/instances/define-instance-placement#compact

Local SSD is ephemeral, so its pretty risky setup to run DB on it. Is it just experimenting or production deployment ? Is still slow if you benchmark a single-node etcd on NVME

zorro786 commented 3 years ago

Both writes and read latencies are high compared to etcd benchmark numbers given.

What do you consider as a baseline here ?

As given here https://etcd.io/docs/v3.4/op-guide/performance/:

100,000 | 8 | 256 | 100 | 1000 | all members | 50,104 | 20ms | 126MB

Where (on which VM) do you run the load generator ?

On one of the etcd nodes.

Do you run all the nodes in a single zone or across the zones of a single region ? All of them in single zone. As mentioned peer latency is ~0.3ms on avg across each other.

Have you configured compact placement policy: https://cloud.google.com/compute/docs/instances/define-instance-placement#compact

No, let me try with this.

Local SSD is ephemeral, so its pretty risky setup to run DB on it. Is it just experimenting or production deployment ?

It is just experimenting- only tried with local SSD after persistent SSD didn't peform well too (comparable numbers with local SSD). If planning for production we will replicate/sync local SSD etcd DB periodically. Note that this will be an external etcd cluster.

Is still slow if you benchmark a single-node etcd on NVME

Yes, I tried to target just the local node/leader too, the performance is slow across all benchmark types given in the link earlier.

ptabor commented 3 years ago

Please try running the client on a separate node, as it might saturate (introduce delays) in CPU and network Iops of one of the nodes. Are you testing 3.4 or 3.5 ?

zorro786 commented 3 years ago

Are you testing 3.4 or 3.5 ?

Testing 3.4 Also now that I checked about compact placement policy, as mentioned in my post AVG RTT between peers is 0.308ms - I am not confident if recreating the nodes with the policy configured will improve anything significantly.

Please try running the client on a separate node, as it might saturate (introduce delays) in CPU and network Iops of one of the nodes.

Running benchmark (100k keys write to all members) from separate node only improved a bit:

Summary:
  Total:    3.8738 secs.
  Slowest:  0.2291 secs.
  Fastest:  0.0250 secs.
  Average:  0.0385 secs.
  Stddev:   0.0150 secs.
  Requests/sec: 25814.5489
zorro786 commented 3 years ago

FWIW, I have the following modified config for etcd:

- --heartbeat-interval=300
- --election-timeout=3000
- --quota-backend-bytes=8589934592
- --snapshot-count=100000
ptabor commented 3 years ago

Keep brainstorming:

zorro786 commented 3 years ago
  • Do you start the DB from scratch on each test, or do you continue on prewarmed DB. (when bbolt is growing data file from 16MB->32MB->64MB its taking a global lock for the memory remapping) ?

Not from scratch- etcd_mvcc_db_total_size_in_bytes is 1.6 GB and etcd_mvcc_db_total_size_in_use_in_bytes is ~100 MB. There has been quite an amount of scale testing done on the DB earlier. But when running benchmarks there is almost no activity/requests to etcd. Is there a way to confirm it is due to increased bbolt file size?

  • Do you know which resource is utilized on the tested node ?

Not sure what you mean here but I am guessing which resource is significantly utilized on the client node - there is no such resource. It is a heavy node (n1-standard-64) with < 2% CPU usage, < 20% memory usage and minimal IO/network activity.

ptabor commented 3 years ago

Not sure what you mean here but I am guessing which resource is significantly utilized on the client node - there is no such resource. It is a heavy node (n1-standard-64) with < 2% CPU usage, < 20% memory usage and minimal IO/network activity.

That data but for the server node.

zorro786 commented 3 years ago

Not sure what you mean here but I am guessing which resource is significantly utilized on the client node - there is no such resource. It is a heavy node (n1-standard-64) with < 2% CPU usage, < 20% memory usage and minimal IO/network activity.

That data but for the server node.

Across all etcd nodes including leader: CPU usage is < 1% Memory usage < 5% IO activity is minimal (from iostat) when idle and increases momentarily during benchmark Idle network usage is < 1 MB/s and increases up to 5 MB/s during benchmark.

zorro786 commented 3 years ago

@ptabor I did the same benchmark on a fresh etcd cluster (with client on different node):

benchmark --endpoints=${HOST_1},${HOST_2},${HOST_3} --conns=100 --clients=1000  \
  put --key-size=8 --sequential-keys --total=100000 --val-size=256

Summary:
  Total:    3.7441 secs.
  Slowest:  0.1961 secs.
  Fastest:  0.0240 secs.
  Average:  0.0372 secs.
  Stddev:   0.0130 secs.
  Requests/sec: 26708.6865

This is the serializable read performance from many clients:

benchmark --endpoints=${HOST_1},${HOST_2},${HOST_3} --conns=100 --clients=1000 \
    range YOUR_KEY --consistency=s --total=100000

Summary:
  Total:    3.2191 secs.
  Slowest:  0.2089 secs.
  Fastest:  0.0228 secs.
  Average:  0.0320 secs.
  Stddev:   0.0133 secs.
  Requests/sec: 31064.1984

Does TLS connection with certs add any additional latencies for the benchmark tool?

ptabor commented 3 years ago

TLS/encyption is usually pretty expensive in terms of computation, especially establishing the connection. The CPU utilization you are posting is unexpectedly low. I start to wonder whether lack of entropy might be a limiting factor here. https://cloud.google.com/compute/docs/instances/enabling-virtio-rng might help.

On Thu, 3 Jun 2021 at 18:23, Abdul Qadeer @.***> wrote:

@ptabor https://github.com/ptabor I did the same benchmark on a fresh etcd cluster (with client on different node):

benchmark --endpoints=${HOST_1},${HOST_2},${HOST_3} --conns=100 --clients=1000 \ put --key-size=8 --sequential-keys --total=100000 --val-size=256

Summary: Total: 3.7441 secs. Slowest: 0.1961 secs. Fastest: 0.0240 secs. Average: 0.0372 secs. Stddev: 0.0130 secs. Requests/sec: 26708.6865

This is the serializable read performance from many clients:

benchmark --endpoints=${HOST_1},${HOST_2},${HOST_3} --conns=100 --clients=1000 \ range YOUR_KEY --consistency=s --total=100000

Summary: Total: 3.2191 secs. Slowest: 0.2089 secs. Fastest: 0.0228 secs. Average: 0.0320 secs. Stddev: 0.0133 secs. Requests/sec: 31064.1984

Does TLS connection with certs add any additional latencies for the benchmark tool?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/etcd-io/etcd/issues/13053#issuecomment-854002985, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXZAY36UJEC4CUDBHLJ32LTQ6T7XANCNFSM45VGCE6Q .

--

Piotr (ptab) Tabor

zorro786 commented 3 years ago

TLS/encyption is usually pretty expensive in terms of computation, especially establishing the connection.

Let me try disabling TLS on etcd and run benchmarks. I was thinking the clients created use the same connection that is established initially for sending requests.

The CPU utilization you are posting is unexpectedly low. I start to wonder whether lack of entropy might be a limiting factor here. https://cloud.google.com/compute/docs/instances/enabling-virtio-rng might help.

Just so you know < 1% CPU is when there is almost no load on etcd and only benchmark tool is run. When I have 2k+ nodes joined with workload running, the CPU on etcd goes around 10-15% (16 core nodes).

Virtio rng is enabled already on all three nodes.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.