TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.57k stars 245 forks source link

ZFS O_SYNC write performance seems slow on HHHL NVMe #847

Open noahmehl opened 4 years ago

noahmehl commented 4 years ago

Test System: Motherboard: SuperMicro X10DRi-T Processors: 2 x Xeon E5-2683 v4 Sixteen-Core Broadwell Processor 2.1GHz Ram: 16 x Hynix 32GB (1X32GB) 2400MHZ PC4-19200 HMA84GR7MFR4N-UH NVMe: Intel SSDPEDKE020T701 2TB P4600 series PCIe HHHL SmartOS (build: 20190815T002608Z) FIO Version: fio-3.15-48-g27f4 (built from source on both systems, commit 27f436d9f72a9d2d3da3adfdf712757152eab29e)

Here's the FIO write test:

; fio-write.job for fiotest

[global]
bs=${BS}K
size=1G
numjobs=32
sync=1
ioengine=sync
thread=1
group_reporting=1

[fio-seq-write]
filename_format=fio.temp.$jobnum
rw=write
stonewall

[fio-rand-write]
filename_format=fio.temp.$jobnum
rw=randwrite
stonewall

In linux with XFS, I'm getting 251,000 sequential write iops at 4K block size, and 252,000 random write iops at 4K block size.

In SmartOS, I'm getting 76,000 sequential write iops at 4K block size, and 56,600 random write iops at 4K block size.

I have tried setting the zfs recordsize=4K, and retesting, but it doesn't seem to make a difference.

Linux test results:

fio-seq-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-rand-write: (g=1): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-3.15-48-g27f4
Starting 64 threads

fio-seq-write: (groupid=0, jobs=32): err= 0: pid=51351: Mon Sep  9 21:23:47 2019
  write: IOPS=251k, BW=979MiB/s (1026MB/s)(32.0GiB/33482msec); 0 zone resets
    clat (usec): min=22, max=5009, avg=121.09, stdev=26.15
     lat (usec): min=22, max=5010, avg=121.36, stdev=26.16
    clat percentiles (usec):
     |  1.00th=[   73],  5.00th=[   89], 10.00th=[   98], 20.00th=[  108],
     | 30.00th=[  112], 40.00th=[  116], 50.00th=[  119], 60.00th=[  123],
     | 70.00th=[  128], 80.00th=[  133], 90.00th=[  143], 95.00th=[  155],
     | 99.00th=[  219], 99.50th=[  269], 99.90th=[  347], 99.95th=[  375],
     | 99.99th=[  449]
   bw (  KiB/s): min=472163, max=843784, per=80.00%, avg=801737.00, stdev=1779.80, samples=2063
   iops        : min=118033, max=210934, avg=200422.00, stdev=444.93, samples=2063
  lat (usec)   : 50=0.05%, 100=11.56%, 250=87.73%, 500=0.66%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=2.28%, sys=35.73%, ctx=20521976, majf=0, minf=1417
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
fio-rand-write: (groupid=1, jobs=32): err= 0: pid=51393: Mon Sep  9 21:23:47 2019
  write: IOPS=252k, BW=983MiB/s (1031MB/s)(32.0GiB/33341msec); 0 zone resets
    clat (usec): min=22, max=5130, avg=120.08, stdev=23.78
     lat (usec): min=22, max=5131, avg=120.35, stdev=23.79
    clat percentiles (usec):
     |  1.00th=[   74],  5.00th=[   90], 10.00th=[  100], 20.00th=[  108],
     | 30.00th=[  112], 40.00th=[  115], 50.00th=[  119], 60.00th=[  123],
     | 70.00th=[  127], 80.00th=[  133], 90.00th=[  141], 95.00th=[  151],
     | 99.00th=[  194], 99.50th=[  239], 99.90th=[  326], 99.95th=[  359],
     | 99.99th=[  445]
   bw (  KiB/s): min=578992, max=1072792, per=100.00%, avg=1029135.01, stdev=2395.77, samples=2068
   iops        : min=144748, max=268198, avg=257283.32, stdev=598.94, samples=2068
  lat (usec)   : 50=0.08%, 100=10.11%, 250=89.39%, 500=0.42%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=2.74%, sys=35.60%, ctx=20208669, majf=0, minf=4294
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=979MiB/s (1026MB/s), 979MiB/s-979MiB/s (1026MB/s-1026MB/s), io=32.0GiB (34.4GB), run=33482-33482msec

Run status group 1 (all jobs):
  WRITE: bw=983MiB/s (1031MB/s), 983MiB/s-983MiB/s (1031MB/s-1031MB/s), io=32.0GiB (34.4GB), run=33341-33341msec

Disk stats (read/write):
  nvme0n1: ios=0/17628239, merge=0/1, ticks=0/329694, in_queue=214737, util=41.26%

SmartOS test results:

fio-seq-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-rand-write: (g=1): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-3.15-48-g27f4-dirty
Starting 64 threads

fio-seq-write: (groupid=0, jobs=32): err= 0: pid=67: Thu Sep 12 21:39:55 2019
  write: IOPS=76.0k, BW=297MiB/s (311MB/s)(32.0GiB/110360msec)
    clat (usec): min=75, max=251424, avg=415.26, stdev=1277.01
     lat (usec): min=75, max=251424, avg=415.48, stdev=1277.02
    clat percentiles (usec):
     |  1.00th=[  116],  5.00th=[  130], 10.00th=[  139], 20.00th=[  157],
     | 30.00th=[  176], 40.00th=[  200], 50.00th=[  237], 60.00th=[  297],
     | 70.00th=[  400], 80.00th=[  570], 90.00th=[  889], 95.00th=[ 1188],
     | 99.00th=[ 1926], 99.50th=[ 2278], 99.90th=[ 3359], 99.95th=[ 4621],
     | 99.99th=[74974]
   bw (  KiB/s): min=50139, max=493100, per=77.93%, avg=236932.47, stdev=3234.11, samples=6966
   iops        : min=12524, max=123262, avg=59221.24, stdev=808.53, samples=6966
  lat (usec)   : 100=0.08%, 250=52.52%, 500=24.04%, 750=9.90%, 1000=5.73%
  lat (msec)   : 2=6.88%, 4=0.79%, 10=0.02%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.02%, 250=0.01%, 500=0.01%
  cpu          : usr=26.64%, sys=1237.33%, ctx=489269663, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
fio-rand-write: (groupid=1, jobs=32): err= 0: pid=99: Thu Sep 12 21:39:55 2019
  write: IOPS=56.6k, BW=221MiB/s (232MB/s)(32.0GiB/148172msec)
    clat (usec): min=74, max=58434, avg=559.27, stdev=778.35
     lat (usec): min=74, max=58434, avg=559.48, stdev=778.37
    clat percentiles (usec):
     |  1.00th=[  112],  5.00th=[  127], 10.00th=[  139], 20.00th=[  165],
     | 30.00th=[  204], 40.00th=[  262], 50.00th=[  355], 60.00th=[  482],
     | 70.00th=[  635], 80.00th=[  840], 90.00th=[ 1221], 95.00th=[ 1614],
     | 99.00th=[ 2540], 99.50th=[ 2966], 99.90th=[ 4228], 99.95th=[ 6128],
     | 99.99th=[31589]
   bw (  KiB/s): min=125680, max=483164, per=100.00%, avg=226934.04, stdev=2249.08, samples=9421
   iops        : min=31420, max=120780, avg=56727.54, stdev=562.28, samples=9421
  lat (usec)   : 100=0.14%, 250=38.24%, 500=22.92%, 750=14.81%, 1000=9.05%
  lat (msec)   : 2=12.36%, 4=2.36%, 10=0.08%, 20=0.01%, 50=0.03%
  lat (msec)   : 100=0.01%
  cpu          : usr=26.59%, sys=837.84%, ctx=489524750, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=297MiB/s (311MB/s), 297MiB/s-297MiB/s (311MB/s-311MB/s), io=32.0GiB (34.4GB), run=110360-110360msec

Run status group 1 (all jobs):
  WRITE: bw=221MiB/s (232MB/s), 221MiB/s-221MiB/s (232MB/s-232MB/s), io=32.0GiB (34.4GB), run=148172-148172msec

NVMe and zpool:

[root@smartos /nvme]# nvmeadm list
nvme0: model: INTEL SSDPEDKE020T7, serial: AAAAAAAAAAAAAAA, FW rev: QDV101D1, NVMe v1.2
  nvme1/5cd2e49575500100 (c2t5CD2E49575500100d0): Size = 1907729 MB, Capacity = 1907729 MB, Used = 1907729 MB

[root@smartos /nvme]# zpool status -v nvme
  pool: nvme
 state: ONLINE
  scan: none requested
config:

    NAME                     STATE     READ WRITE CKSUM
    nvme                     ONLINE       0     0     0
      c2t5CD2E49575500100d0  ONLINE       0     0     0
noahmehl commented 4 years ago

@rmustacc very astutely suggested I run the test on the raw device. The IOPs are much better. I'm getting 205,000 sequential write iops at 4K block size, and 199,000 random write iops at 4K block size.

Here is the test definition:

[global]
bs=${BS}K
size=1G
numjobs=32
sync=1
ioengine=sync
thread=1
group_reporting=1
filename=/dev/rdsk/c2t5CD2E49575500100d0

[fio-seq-write]
rw=write
stonewall

[fio-rand-write]
rw=randwrite
stonewall

Here are the results:

fio-seq-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-rand-write: (g=1): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-3.15-48-g27f4-dirty
Starting 64 threads
fio-seq-write: (groupid=0, jobs=32): err= 0: pid=67: Fri Sep 13 20:35:21 2019
  write: IOPS=205k, BW=799MiB/s (838MB/s)(32.0GiB/41017msec)
    clat (usec): min=14, max=19764, avg=148.46, stdev=374.73
     lat (usec): min=14, max=19765, avg=148.58, stdev=374.78
    clat percentiles (usec):
     |  1.00th=[   20],  5.00th=[   22], 10.00th=[   24], 20.00th=[   25],
     | 30.00th=[   27], 40.00th=[   30], 50.00th=[   32], 60.00th=[   38],
     | 70.00th=[   49], 80.00th=[   93], 90.00th=[  388], 95.00th=[  783],
     | 99.00th=[ 1860], 99.50th=[ 2409], 99.90th=[ 3884], 99.95th=[ 4555],
     | 99.99th=[ 6259]
   bw (  KiB/s): min=519339, max=1228894, per=82.78%, avg=677161.04, stdev=4521.37, samples=2485
   iops        : min=129824, max=307211, avg=169278.21, stdev=1130.35, samples=2485
  lat (usec)   : 20=1.10%, 50=69.75%, 100=9.83%, 250=6.44%, 500=4.72%
  lat (usec)   : 750=2.89%, 1000=1.79%
  lat (msec)   : 2=2.63%, 4=0.75%, 10=0.09%, 20=0.01%
  cpu          : usr=55.01%, sys=194.98%, ctx=263237911, majf=128, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
fio-rand-write: (groupid=1, jobs=32): err= 0: pid=99: Fri Sep 13 20:35:21 2019
  write: IOPS=199k, BW=776MiB/s (814MB/s)(32.0GiB/42223msec)
    clat (usec): min=14, max=19681, avg=151.36, stdev=390.11
     lat (usec): min=14, max=19681, avg=151.50, stdev=390.16
    clat percentiles (usec):
     |  1.00th=[   20],  5.00th=[   22], 10.00th=[   23], 20.00th=[   25],
     | 30.00th=[   27], 40.00th=[   29], 50.00th=[   32], 60.00th=[   37],
     | 70.00th=[   47], 80.00th=[   88], 90.00th=[  388], 95.00th=[  807],
     | 99.00th=[ 1975], 99.50th=[ 2540], 99.90th=[ 4015], 99.95th=[ 4686],
     | 99.99th=[ 6194]
   bw (  KiB/s): min=626491, max=1510037, per=100.00%, avg=825742.46, stdev=5658.04, samples=2579
   iops        : min=156620, max=377499, avg=206431.59, stdev=1414.45, samples=2579
  lat (usec)   : 20=1.09%, 50=70.68%, 100=9.43%, 250=6.12%, 500=4.42%
  lat (usec)   : 750=2.79%, 1000=1.77%
  lat (msec)   : 2=2.75%, 4=0.87%, 10=0.10%, 20=0.01%
  cpu          : usr=76.36%, sys=198.03%, ctx=264163850, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=799MiB/s (838MB/s), 799MiB/s-799MiB/s (838MB/s-838MB/s), io=32.0GiB (34.4GB), run=41017-41017msec

Run status group 1 (all jobs):
  WRITE: bw=776MiB/s (814MB/s), 776MiB/s-776MiB/s (814MB/s-814MB/s), io=32.0GiB (34.4GB), run=42223-42223msec

To compare, I ran the same test in Linux. I'm getting 358,000 sequential write iops at 4K block size, and 374,000 random write iops at 4K block size.

Here's the test definition:

[global]
bs=${BS}K
size=1G
numjobs=32
sync=1
ioengine=sync
thread=1
group_reporting=1
filename=/dev/nvme0n1

[fio-seq-write]
rw=write
stonewall

[fio-rand-write]
rw=randwrite
stonewall

Here are the results from linux to the raw NVMe:

fio-seq-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-rand-write: (g=1): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-3.15-48-g27f4
Starting 64 threads
fio-seq-write: (groupid=0, jobs=32): err= 0: pid=22639: Fri Sep 13 21:08:43 2019
  write: IOPS=358k, BW=1400MiB/s (1468MB/s)(32.0GiB/23409msec); 0 zone resets
    clat (usec): min=12, max=4630, avg=88.46, stdev=62.97
     lat (usec): min=12, max=4630, avg=88.57, stdev=62.98
    clat percentiles (usec):
     |  1.00th=[   19],  5.00th=[   29], 10.00th=[   39], 20.00th=[   54],
     | 30.00th=[   65], 40.00th=[   72], 50.00th=[   79], 60.00th=[   91],
     | 70.00th=[  106], 80.00th=[  117], 90.00th=[  133], 95.00th=[  155],
     | 99.00th=[  255], 99.50th=[  359], 99.90th=[  775], 99.95th=[ 1037],
     | 99.99th=[ 1975]
   bw (  MiB/s): min=  850, max= 1470, per=79.24%, avg=1109.21, stdev= 2.50, samples=1472
   iops        : min=217663, max=376546, avg=283944.43, stdev=639.26, samples=1472
  lat (usec)   : 20=1.60%, 50=15.84%, 100=47.67%, 250=33.83%, 500=0.80%
  lat (usec)   : 750=0.15%, 1000=0.05%
  lat (msec)   : 2=0.04%, 4=0.01%, 10=0.01%
  cpu          : usr=2.02%, sys=36.47%, ctx=11574602, majf=0, minf=1355
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
fio-rand-write: (groupid=1, jobs=32): err= 0: pid=22677: Fri Sep 13 21:08:43 2019
  write: IOPS=374k, BW=1460MiB/s (1531MB/s)(32.0GiB/22445msec); 0 zone resets
    clat (usec): min=14, max=2773, avg=83.07, stdev=106.66
     lat (usec): min=14, max=2773, avg=83.22, stdev=106.67
    clat percentiles (usec):
     |  1.00th=[   21],  5.00th=[   36], 10.00th=[   48], 20.00th=[   57],
     | 30.00th=[   60], 40.00th=[   63], 50.00th=[   65], 60.00th=[   68],
     | 70.00th=[   71], 80.00th=[   76], 90.00th=[   95], 95.00th=[  161],
     | 99.00th=[  644], 99.50th=[  914], 99.90th=[ 1270], 99.95th=[ 1385],
     | 99.99th=[ 1778]
   bw (  MiB/s): min= 1079, max= 1536, per=100.00%, avg=1462.05, stdev= 2.28, samples=1407
   iops        : min=276346, max=393363, avg=374283.51, stdev=582.44, samples=1407
  lat (usec)   : 20=0.77%, 50=10.89%, 100=79.20%, 250=6.23%, 500=1.49%
  lat (usec)   : 750=0.63%, 1000=0.40%
  lat (msec)   : 2=0.37%, 4=0.01%
  cpu          : usr=3.05%, sys=39.15%, ctx=8390566, majf=0, minf=2524
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1400MiB/s (1468MB/s), 1400MiB/s-1400MiB/s (1468MB/s-1468MB/s), io=32.0GiB (34.4GB), run=23409-23409msec

Run status group 1 (all jobs):
  WRITE: bw=1460MiB/s (1531MB/s), 1460MiB/s-1460MiB/s (1531MB/s-1531MB/s), io=32.0GiB (34.4GB), run=22445-22445msec

Disk stats (read/write):
  nvme0n1: ios=177/15784954, merge=0/0, ticks=19/800122, in_queue=0, util=99.39%
noahmehl commented 4 years ago

For comparison sake, I ran the same test in Linux:

Distribution: Ubuntu 18.04 Kernel: 5.0.0-23-generic #24~18.04.1-Ubuntu ZOL: 0.7.5-1ubuntu16.6

Turns out that the SmartOS ZFS performance exceeds the ZOL performance by quite a bit. I'm getting 40,800 sequential write iops at 4K block size, and 12,700 random write iops at 4K block size.

Here's the test definition:

[global]
bs=${BS}K
size=1G
numjobs=32
sync=1
ioengine=sync
thread=1
group_reporting=1

[fio-seq-write]
filename_format=fio.temp.$jobnum
rw=write
stonewall

[fio-rand-write]
filename_format=fio.temp.$jobnum
rw=randwrite
stonewall

Here are the results:

fio-seq-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-rand-write: (g=1): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-3.15-48-g27f4
Starting 64 threads
[w=32.7k IOPS][eta 00m:00s]
fio-seq-write: (groupid=0, jobs=32): err= 0: pid=1403: Fri Sep 13 21:28:48 2019
  write: IOPS=40.8k, BW=159MiB/s (167MB/s)(32.0GiB/205709msec); 0 zone resets
    clat (usec): min=97, max=106798, avg=780.19, stdev=355.19
     lat (usec): min=97, max=106798, avg=780.72, stdev=355.19
    clat percentiles (usec):
     |  1.00th=[  482],  5.00th=[  553], 10.00th=[  594], 20.00th=[  644],
     | 30.00th=[  676], 40.00th=[  701], 50.00th=[  725], 60.00th=[  750],
     | 70.00th=[  783], 80.00th=[  824], 90.00th=[  906], 95.00th=[ 1074],
     | 99.00th=[ 2343], 99.50th=[ 3392], 99.90th=[ 4686], 99.95th=[ 5014],
     | 99.99th=[ 5866]
   bw (  KiB/s): min=123077, max=227353, per=86.74%, avg=141486.09, stdev=224.40, samples=13145
   iops        : min=30760, max=56829, avg=35359.38, stdev=56.10, samples=13145
  lat (usec)   : 100=0.01%, 250=0.01%, 500=1.51%, 750=57.20%, 1000=34.82%
  lat (msec)   : 2=5.10%, 4=1.08%, 10=0.28%, 250=0.01%
  cpu          : usr=1.01%, sys=22.20%, ctx=9487792, majf=0, minf=5118
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
fio-rand-write: (groupid=1, jobs=32): err= 0: pid=16729: Fri Sep 13 21:28:48 2019
  write: IOPS=12.7k, BW=49.5MiB/s (51.9MB/s)(32.0GiB/662003msec); 0 zone resets
    clat (usec): min=152, max=110718, avg=2515.68, stdev=1205.13
     lat (usec): min=152, max=110719, avg=2516.42, stdev=1205.11
    clat percentiles (usec):
     |  1.00th=[  611],  5.00th=[  996], 10.00th=[ 1287], 20.00th=[ 1598],
     | 30.00th=[ 1844], 40.00th=[ 2073], 50.00th=[ 2311], 60.00th=[ 2573],
     | 70.00th=[ 2900], 80.00th=[ 3294], 90.00th=[ 4047], 95.00th=[ 4817],
     | 99.00th=[ 6128], 99.50th=[ 6587], 99.90th=[ 8356], 99.95th=[ 9372],
     | 99.99th=[13566]
   bw (  KiB/s): min=34621, max=133080, per=99.90%, avg=50637.13, stdev=237.27, samples=42337
   iops        : min= 8655, max=33264, avg=12659.08, stdev=59.32, samples=42337
  lat (usec)   : 250=0.01%, 500=0.26%, 750=2.20%, 1000=2.56%
  lat (msec)   : 2=32.12%, 4=52.60%, 10=10.23%, 20=0.03%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=0.50%, sys=10.65%, ctx=13178677, majf=0, minf=9297
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=32.0GiB (34.4GB), run=205709-205709msec

Run status group 1 (all jobs):
  WRITE: bw=49.5MiB/s (51.9MB/s), 49.5MiB/s-49.5MiB/s (51.9MB/s-51.9MB/s), io=32.0GiB (34.4GB), run=662003-662003msec
noahmehl commented 4 years ago

The zdb output from the zpool in SmartOS was requested on IRC:

Cached configuration:
        version: 5000
        name: 'nvme'
        state: 0
        txg: 14
        pool_guid: 15296427983412874009
        errata: 0
        hostname: 'libzpool'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 15296427983412874009
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 10251368872960486760
                path: '/dev/dsk/c2t5CD2E49575500100d0s0'
                devid: 'id1,kdev@w5cd2e49575500100/a'
                phys_path: '/pci@0,0/pci8086,6f08@3/pci8086,4714@0/blkdev@w5CD2E49575500100,0:a'
                whole_disk: 1
                metaslab_array: 256
                metaslab_shift: 34
                ashift: 12
                asize: 2000385474560
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 129
                com.delphix:vdev_zap_top: 130
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data

MOS Configuration:
        version: 5000
        name: 'nvme'
        state: 0
        txg: 14
        pool_guid: 15296427983412874009
        errata: 0
        hostid: 1233894251
        hostname: 'smartos'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 15296427983412874009
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 10251368872960486760
                path: '/dev/dsk/c2t5CD2E49575500100d0s0'
                devid: 'id1,kdev@w5cd2e49575500100/a'
                phys_path: '/pci@0,0/pci8086,6f08@3/pci8086,4714@0/blkdev@w5CD2E49575500100,0:a'
                whole_disk: 1
                metaslab_array: 256
                metaslab_shift: 34
                ashift: 12
                asize: 2000385474560
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 129
                com.delphix:vdev_zap_top: 130
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data

Uberblock:
    magic = 0000000000bab10c
    version = 5000
    txg = 17
    guid_sum = 7101052782663809153
    timestamp = 1568411417 UTC = Fri Sep 13 21:50:17 2019
    mmp_magic = 00000000a11cea11
    mmp_delay = 0
    checkpoint_txg = 0

All DDTs are empty

Metaslabs:
    vdev          0   
    metaslabs   116   offset                spacemap          free        
    ---------------   -------------------   ---------------   ------------
    metaslab      0   offset            0   spacemap    265   free    16.0G
    On-disk histogram:      fragmentation 0
             12:      1 *
             13:      1 *
             14:      0 
             15:      1 *
             16:      0 
             17:      0 
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 265:
  smp_length = 0xf0
  smp_alloc = 0x9000
    metaslab      1   offset    400000000   spacemap    264   free    16.0G
    On-disk histogram:      fragmentation 0
             12:      1 *
             13:      1 *
             14:      0 
             15:      1 *
             16:      0 
             17:      0 
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 264:
  smp_length = 0xf0
  smp_alloc = 0x9000
    metaslab      2   offset    800000000   spacemap    263   free    16.0G
    On-disk histogram:      fragmentation 0
             13:      1 *
             14:      1 *
             15:      0 
             16:      0 
             17:      1 *
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 263:
  smp_length = 0x1b8
  smp_alloc = 0x14000
    metaslab      3   offset    c00000000   spacemap    262   free    16.0G
    On-disk histogram:      fragmentation 0
             13:      1 *
             14:      1 *
             15:      0 
             16:      0 
             17:      1 *
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 262:
  smp_length = 0x1b8
  smp_alloc = 0x14000
    metaslab      4   offset   1000000000   spacemap    261   free    16.0G
    On-disk histogram:      fragmentation 0
             13:      1 *
             14:      0 
             15:      0 
             16:      1 *
             17:      0 
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 261:
  smp_length = 0x158
  smp_alloc = 0xb000
    metaslab      5   offset   1400000000   spacemap    260   free    16.0G
    On-disk histogram:      fragmentation 0
             13:      1 *
             14:      0 
             15:      0 
             16:      1 *
             17:      0 
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 260:
  smp_length = 0x158
  smp_alloc = 0xb000
    metaslab      6   offset   1800000000   spacemap    259   free    16.0G
    On-disk histogram:      fragmentation 0
             12:      1 *
             13:      1 *
             14:      0 
             15:      0 
             16:      1 *
             17:      0 
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 259:
  smp_length = 0x198
  smp_alloc = 0xd000
    metaslab      7   offset   1c00000000   spacemap    258   free    16.0G
    On-disk histogram:      fragmentation 0
             12:      1 *
             13:      1 *
             14:      0 
             15:      0 
             16:      1 *
             17:      0 
             18:      0 
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 258:
  smp_length = 0x198
  smp_alloc = 0xd000
    metaslab      8   offset   2000000000   spacemap    257   free    16.0G
    On-disk histogram:      fragmentation 0
             12:      2 **
             13:      1 *
             14:      0 
             15:      1 *
             16:      0 
             17:      0 
             18:      1 *
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      1 *
space map object 257:
  smp_length = 0x228
  smp_alloc = 0x29000
    metaslab      9   offset   2400000000   spacemap      0   free      16G
    metaslab     10   offset   2800000000   spacemap      0   free      16G
    metaslab     11   offset   2c00000000   spacemap      0   free      16G
    metaslab     12   offset   3000000000   spacemap      0   free      16G
    metaslab     13   offset   3400000000   spacemap      0   free      16G
    metaslab     14   offset   3800000000   spacemap      0   free      16G
    metaslab     15   offset   3c00000000   spacemap      0   free      16G
    metaslab     16   offset   4000000000   spacemap      0   free      16G
    metaslab     17   offset   4400000000   spacemap      0   free      16G
    metaslab     18   offset   4800000000   spacemap      0   free      16G
    metaslab     19   offset   4c00000000   spacemap      0   free      16G
    metaslab     20   offset   5000000000   spacemap      0   free      16G
    metaslab     21   offset   5400000000   spacemap      0   free      16G
    metaslab     22   offset   5800000000   spacemap      0   free      16G
    metaslab     23   offset   5c00000000   spacemap      0   free      16G
    metaslab     24   offset   6000000000   spacemap      0   free      16G
    metaslab     25   offset   6400000000   spacemap      0   free      16G
    metaslab     26   offset   6800000000   spacemap      0   free      16G
    metaslab     27   offset   6c00000000   spacemap      0   free      16G
    metaslab     28   offset   7000000000   spacemap      0   free      16G
    metaslab     29   offset   7400000000   spacemap      0   free      16G
    metaslab     30   offset   7800000000   spacemap      0   free      16G
    metaslab     31   offset   7c00000000   spacemap      0   free      16G
    metaslab     32   offset   8000000000   spacemap      0   free      16G
    metaslab     33   offset   8400000000   spacemap      0   free      16G
    metaslab     34   offset   8800000000   spacemap      0   free      16G
    metaslab     35   offset   8c00000000   spacemap      0   free      16G
    metaslab     36   offset   9000000000   spacemap      0   free      16G
    metaslab     37   offset   9400000000   spacemap      0   free      16G
    metaslab     38   offset   9800000000   spacemap      0   free      16G
    metaslab     39   offset   9c00000000   spacemap      0   free      16G
    metaslab     40   offset   a000000000   spacemap      0   free      16G
    metaslab     41   offset   a400000000   spacemap      0   free      16G
    metaslab     42   offset   a800000000   spacemap      0   free      16G
    metaslab     43   offset   ac00000000   spacemap      0   free      16G
    metaslab     44   offset   b000000000   spacemap      0   free      16G
    metaslab     45   offset   b400000000   spacemap      0   free      16G
    metaslab     46   offset   b800000000   spacemap      0   free      16G
    metaslab     47   offset   bc00000000   spacemap      0   free      16G
    metaslab     48   offset   c000000000   spacemap      0   free      16G
    metaslab     49   offset   c400000000   spacemap      0   free      16G
    metaslab     50   offset   c800000000   spacemap      0   free      16G
    metaslab     51   offset   cc00000000   spacemap      0   free      16G
    metaslab     52   offset   d000000000   spacemap      0   free      16G
    metaslab     53   offset   d400000000   spacemap      0   free      16G
    metaslab     54   offset   d800000000   spacemap      0   free      16G
    metaslab     55   offset   dc00000000   spacemap      0   free      16G
    metaslab     56   offset   e000000000   spacemap      0   free      16G
    metaslab     57   offset   e400000000   spacemap      0   free      16G
    metaslab     58   offset   e800000000   spacemap      0   free      16G
    metaslab     59   offset   ec00000000   spacemap      0   free      16G
    metaslab     60   offset   f000000000   spacemap      0   free      16G
    metaslab     61   offset   f400000000   spacemap      0   free      16G
    metaslab     62   offset   f800000000   spacemap      0   free      16G
    metaslab     63   offset   fc00000000   spacemap      0   free      16G
    metaslab     64   offset  10000000000   spacemap      0   free      16G
    metaslab     65   offset  10400000000   spacemap      0   free      16G
    metaslab     66   offset  10800000000   spacemap      0   free      16G
    metaslab     67   offset  10c00000000   spacemap      0   free      16G
    metaslab     68   offset  11000000000   spacemap      0   free      16G
    metaslab     69   offset  11400000000   spacemap      0   free      16G
    metaslab     70   offset  11800000000   spacemap      0   free      16G
    metaslab     71   offset  11c00000000   spacemap      0   free      16G
    metaslab     72   offset  12000000000   spacemap      0   free      16G
    metaslab     73   offset  12400000000   spacemap      0   free      16G
    metaslab     74   offset  12800000000   spacemap      0   free      16G
    metaslab     75   offset  12c00000000   spacemap      0   free      16G
    metaslab     76   offset  13000000000   spacemap      0   free      16G
    metaslab     77   offset  13400000000   spacemap      0   free      16G
    metaslab     78   offset  13800000000   spacemap      0   free      16G
    metaslab     79   offset  13c00000000   spacemap      0   free      16G
    metaslab     80   offset  14000000000   spacemap      0   free      16G
    metaslab     81   offset  14400000000   spacemap      0   free      16G
    metaslab     82   offset  14800000000   spacemap      0   free      16G
    metaslab     83   offset  14c00000000   spacemap      0   free      16G
    metaslab     84   offset  15000000000   spacemap      0   free      16G
    metaslab     85   offset  15400000000   spacemap      0   free      16G
    metaslab     86   offset  15800000000   spacemap      0   free      16G
    metaslab     87   offset  15c00000000   spacemap      0   free      16G
    metaslab     88   offset  16000000000   spacemap      0   free      16G
    metaslab     89   offset  16400000000   spacemap      0   free      16G
    metaslab     90   offset  16800000000   spacemap      0   free      16G
    metaslab     91   offset  16c00000000   spacemap      0   free      16G
    metaslab     92   offset  17000000000   spacemap      0   free      16G
    metaslab     93   offset  17400000000   spacemap      0   free      16G
    metaslab     94   offset  17800000000   spacemap      0   free      16G
    metaslab     95   offset  17c00000000   spacemap      0   free      16G
    metaslab     96   offset  18000000000   spacemap      0   free      16G
    metaslab     97   offset  18400000000   spacemap      0   free      16G
    metaslab     98   offset  18800000000   spacemap      0   free      16G
    metaslab     99   offset  18c00000000   spacemap      0   free      16G
    metaslab    100   offset  19000000000   spacemap      0   free      16G
    metaslab    101   offset  19400000000   spacemap      0   free      16G
    metaslab    102   offset  19800000000   spacemap      0   free      16G
    metaslab    103   offset  19c00000000   spacemap      0   free      16G
    metaslab    104   offset  1a000000000   spacemap      0   free      16G
    metaslab    105   offset  1a400000000   spacemap      0   free      16G
    metaslab    106   offset  1a800000000   spacemap      0   free      16G
    metaslab    107   offset  1ac00000000   spacemap      0   free      16G
    metaslab    108   offset  1b000000000   spacemap      0   free      16G
    metaslab    109   offset  1b400000000   spacemap      0   free      16G
    metaslab    110   offset  1b800000000   spacemap      0   free      16G
    metaslab    111   offset  1bc00000000   spacemap      0   free      16G
    metaslab    112   offset  1c000000000   spacemap      0   free      16G
    metaslab    113   offset  1c400000000   spacemap      0   free      16G
    metaslab    114   offset  1c800000000   spacemap      0   free      16G
    metaslab    115   offset  1cc00000000   spacemap      0   free      16G

    vdev          0     metaslabs  116      fragmentation  0%
             12:      6 ******
             13:      9 *********
             14:      2 **
             15:      3 ***
             16:      4 ****
             17:      2 **
             18:      1 *
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      9 *********
    pool nvme   fragmentation     0%
             12:      6 ******
             13:      9 *********
             14:      2 **
             15:      3 ***
             16:      4 ****
             17:      2 **
             18:      1 *
             19:      0 
             20:      0 
             21:      0 
             22:      0 
             23:      0 
             24:      0 
             25:      0 
             26:      0 
             27:      0 
             28:      0 
             29:      0 
             30:      0 
             31:      0 
             32:      0 
             33:      9 *********
Dataset mos [META], ID 0, cr_txg 4, 312K, 46 objects (inconsistent)

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    2   128K    16K    60K    512   144K   15.97  DMU dnode
         1    1   128K    16K    24K    512    32K  100.00  object directory
        32    1   128K    512      0    512    512    0.00  DSL directory
        33    1   128K    512      0    512    512  100.00  DSL props
        34    1   128K    512      0    512    512  100.00  DSL directory child map
        35    1   128K    512      0    512    512    0.00  DSL directory
        36    1   128K    512      0    512    512  100.00  DSL props
        37    1   128K    512      0    512    512  100.00  DSL directory child map
        38    1   128K    512      0    512    512    0.00  DSL directory
        39    1   128K    512      0    512    512  100.00  DSL props
        40    1   128K    512      0    512    512  100.00  DSL directory child map
        41    1   128K   128K      0    512   128K    0.00  bpobj
        42    1   128K    512      0    512    512    0.00  DSL directory
        43    1   128K    512      0    512    512  100.00  DSL props
        44    1   128K    512      0    512    512  100.00  DSL directory child map
        45    1   128K    512      0    512    512    0.00  DSL dataset
        46    1   128K    512      0    512    512  100.00  DSL dataset snap map
        47    1   128K    512      0    512    512  100.00  DSL deadlist map
        48    1   128K    512      0    512    512    0.00  DSL dataset
        49    1   128K    512      0    512    512  100.00  DSL deadlist map
        50    1   128K   128K      0    512   128K    0.00  bpobj
        51    1   128K     1K    12K    512     1K  100.00  zap
        52    1   128K     1K    12K    512     1K  100.00  zap
        53    1   128K    16K    36K    512    32K  100.00  zap
        54    1   128K    512      0    512    512    0.00  DSL dataset
        55    1   128K    512      0    512    512  100.00  DSL dataset snap map
        56    1   128K    512      0    512    512  100.00  DSL deadlist map
        57    1   128K   128K      0    512   128K    0.00  bpobj
        58    1   128K    512      0    512    512  100.00  DSL dataset next clones
        59    1   128K    512      0    512    512  100.00  DSL dir clones
        60    1   128K    16K    12K    512    16K  100.00  packed nvlist
        61    1   128K    16K    12K    512    16K  100.00  bpobj (Z=uncompressed)
        62    1   128K   128K    12K    512   128K  100.00  SPA history
        63    1   128K  1.50K    12K    512  1.50K  100.00  zap
       128    1   128K    512      0    512    512  100.00  zap
       129    1   128K    512      0    512    512  100.00  zap
       130    1   128K    512      0    512    512  100.00  zap
       256    1   128K    512      0    512    512  100.00  object array
       257    1   128K     4K    12K    512     4K  100.00  SPA space map
       258    1   128K     4K    12K    512     4K  100.00  SPA space map
       259    1   128K     4K    12K    512     4K  100.00  SPA space map
       260    1   128K     4K    12K    512     4K  100.00  SPA space map
       261    1   128K     4K    12K    512     4K  100.00  SPA space map
       262    1   128K     4K    12K    512     4K  100.00  SPA space map
       263    1   128K     4K    12K    512     4K  100.00  SPA space map
       264    1   128K     4K    12K    512     4K  100.00  SPA space map
       265    1   128K     4K    12K    512     4K  100.00  SPA space map

    Dnode slots:
    Total used:            46
    Max used:             265
    Percent empty:  82.641509

Dataset nvme [ZPL], ID 54, cr_txg 1, 96K, 7 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K    56K    512    32K   10.94  DMU dnode
18446744073709551615    1   128K    512      0    512    512  100.00  ZFS user/group used
18446744073709551614    1   128K    512      0    512    512  100.00  ZFS user/group used
         1    1   128K     1K     8K    512     1K  100.00  ZFS master node
        32    1   128K    512      0    512    512  100.00  SA master node
        33    1   128K    512      0    512    512  100.00  ZFS delete queue
        34    1   128K    512      0    512    512  100.00  ZFS directory
        35    1   128K  1.50K     8K    512  1.50K  100.00  SA attr registration
        36    1   128K    16K    16K    512    32K  100.00  SA attr layouts
        37    1   128K    512      0    512    512  100.00  ZFS directory

    Dnode slots:
    Total used:             7
    Max used:              37
    Percent empty:  81.081081

Verified large_blocks feature refcount of 0 is correct
Verified large_dnode feature refcount of 0 is correct
Verified sha512 feature refcount of 0 is correct
Verified skein feature refcount of 0 is correct
Verified edonr feature refcount of 0 is correct
Verified encryption feature refcount of 0 is correct
Verified bookmark_v2 feature refcount of 0 is correct
Verified device_removal feature refcount of 0 is correct
Verified indirect_refcount feature refcount of 0 is correct

Traversing all blocks to verify checksums and verify nothing leaked ...

loading concrete vdev 0, metaslab 115 of 116 ...

    No leaks (block sum matches space maps exactly)

    bp count:                    77
    ganged count:                 0
    bp logical:             1556992      avg:  20220
    bp physical:             196608      avg:   2553     compression:   7.92
    bp allocated:            602112      avg:   7819     compression:   2.59
    bp deduped:                   0    ref>1:      0   deduplication:   1.00
    Normal class:            602112     used:  0.00%

    additional, non-pointer bps of type 0:         25
    Dittoed blocks on same vdev: 52

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     24K     12K    4.00     4.08  object directory
     -      -       -       -       -       -        -  object array
     1    16K      4K     12K     12K    4.00     2.04  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     1    16K      4K     12K     12K    4.00     2.04  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
     9    36K     36K    108K     12K    1.00    18.37  SPA space map
     -      -       -       -       -       -        -  ZIL intent log
    12   864K     48K    116K   9.67K   18.00    19.73  DMU dnode
     2     4K      4K     20K     10K    1.00     3.40  DMU objset
     -      -       -       -       -       -        -  DSL directory
     -      -       -       -       -       -        -  DSL directory child map
     -      -       -       -       -       -        -  DSL dataset snap map
     -      -       -       -       -       -        -  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
     -      -       -       -       -       -        -  ZFS plain file
     -      -       -       -       -       -        -  ZFS directory
     1     1K      1K      8K      8K    1.00     1.36  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     1   128K      4K     12K     12K   32.00     2.04  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     -      -       -       -       -       -        -  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
     -      -       -       -       -       -        -  ZFS user/group used
     -      -       -       -       -       -        -  ZFS user/group quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
     -      -       -       -       -       -        -  System attributes
     -      -       -       -       -       -        -  SA master node
     1  1.50K   1.50K      8K      8K    1.00     1.36  SA attr registration
     2    32K      8K     16K      8K    4.00     2.72  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
     -      -       -       -       -       -        -  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     -      -       -       -       -       -        -  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
    15   342K     58K    180K     12K    5.90    30.61  deferred free
     -      -       -       -       -       -        -  dedup ditto
     8    37K   15.5K     72K      9K    2.39    12.24  other
    77  1.48M    192K    588K   7.64K    7.92   100.00  Total

                            capacity   operations   bandwidth  ---- errors ----
description                used avail  read write  read write  read write cksum
nvme                       588K 1.81T   170     0 1.23M     0     0     0     0
  /dev/dsk/c2t5CD2E49575500100d0s0           588K 1.81T   170     0 1.23M     0     0     0     0

History:
unrecognized record:
  history internal str: 'pool version 5000; software version 5000/5; uts smartos 5.11 joyent_20190815T002608Z i86pc'
  internal_name: 'create'
  history txg: 5
  history time: 1568411412
  history hostname: 'smartos'
2019-09-13.21:50:12 zpool create nvme c2t5CD2E49575500100d0
  history command: 'zpool create nvme c2t5CD2E49575500100d0'
  history who: 0
  history time: 1568411412
  history hostname: 'smartos'
unrecognized record:
  ioctl: 'vdev_set_state pool: nvme cookie: 7 guid: 8e442dad064e6d68'
  history who: 0
  history time: 1568411412
  history hostname: 'smartos'
unrecognized record:
  ioctl: 'vdev_set_state pool: nvme cookie: 7 guid: 8e442dad064e6d68'
  history who: 0
  history time: 1568411412
  history hostname: 'smartos'
unrecognized record:
  ioctl: 'vdev_set_state pool: nvme cookie: 7 guid: 8e442dad064e6d68'
  history who: 0
  history time: 1568411412
  history hostname: 'smartos'
unrecognized record:
  ioctl: 'vdev_set_state pool: nvme cookie: 7 guid: 8e442dad064e6d68'
  history who: 0
  history time: 1568411417
  history hostname: 'smartos'

ZFS_DBGMSG(zdb):
spa_open_common: opening nvme
spa_load(nvme, config trusted): LOADING
disk vdev '/dev/dsk/c2t5CD2E49575500100d0s0': best uberblock found for spa nvme. txg 17
spa_load(nvme, config untrusted): using uberblock with txg=17
spa_load(nvme, config trusted): LOADED
noahmehl commented 4 years ago

From @mgerdts on IRC:

Does prstat -mLc -n 10 1 show any fio or zpool_* threads with usr+sys close to 100? Any with LAT > 1?

I have not tested this, and also discussed on IRC, the zpool is not in raidz. But I'm documenting here for future reference for load testing ZFS.

noahmehl commented 4 years ago

I'm going to test to see if there is a thread or set of threads that might be bottlenecking by checking: prstat -mLc -n 10 1 during the test.

noahmehl commented 4 years ago

@mgerdts also recommended testing:

noahmehl commented 4 years ago

Also from @mgerdts for checking what sector size an NVMe really is:

prtconf -vp gives all device properties. Hunt down your disk and look for device-blksize and device-pblksize. Those are the logical and physical block sizes advertised by the disk. The values are on the line that follows, in hex. device-pblksize may not exist, in which case the logical and physical sizes are the same.

rmustacc commented 4 years ago

On 11/14/19 11:18 AM, noahmehl wrote:

Also from @mgerdts for checking what sector size an NVMe really is:

prtconf -vp gives all device properties. Hunt down your disk and look for device-blksize and device-pblksize. Those are the logical and physical block sizes advertised by the disk. The values are on the line that follows, in hex. device-pblksize may not exist, in which case the logical and physical sizes are the same.

You may want to look at illumos#11827 (https://www.illumos.org/issues/11827) and https://www.illumos.org/rb/r/2401/.