Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

[BUG] Slow ephemeral OS disk compared to regular VM #3787

Open skycaptain opened 12 months ago

skycaptain commented 12 months ago

Describe the bug

We are using an AKS cluster with KEDA to run our Azure Pipelines jobs. As our build jobs are quite IO-intensive, we chose to use Standard_D32ads_v5 machines with Ephemeral OS disks. This allows us to put all workload on the temporary disk of the machines and simplifies our setup.

While conducting benchmarks with the fio commands from the docs to verify IOPS performance, we observed unexpected behavior with the AKS nodes:

$ fio --name=test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
$ fio --name=test --rw=randread --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
$ fio --name=test --rw=randwrite --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting

On average, we only achieve 80k IOPS instead of the specified 150k IOPS. This result made us curious, so we re-ran the same commands while directly SSH-ing into one of our nodes to eliminate any pipeline/container/Kubernetes overhead, and we obtained the same results. We then repeated the same tests on a separate Azure VM with the same settings (Standard_D32ads_v5 on Ephemeral OS disk in the same region), and we actually achieved the specified 150k IOPS. It appears that the Ephemeral OS disks on AKS-managed VMs are not as fast as those on regular Azure VMs. Are we missing something?

To Reproduce Steps to reproduce the behavior:

  1. Setup a AKS cluster with Standard_D32ads_v5 machines and ephemeral OS disks
  2. Run above commands
  3. Check avg. IOPS

Expected behavior I expect to achieve the same level of performance on AKS nodes as on regular Azure VMs with ephemeral OS disks.

Screenshots N/A.

Environment (please complete the following information):

Additional context N/A.

alexeldeib commented 12 months ago

hmmm. you're not missing something.

what's the exact configuration when you run the test? are you writing into the pod root overlayfs, host mounting the disk, emptyDir, etc?

skycaptain commented 11 months ago

what's the exact configuration when you run the test? are you writing into the pod root overlayfs, host mounting the disk, emptyDir, etc?

We use emptyDir for our workload. However, since we initially suspected that the problem might be with overlayfs or other drivers, we used kubectl-exec to get a shell directly on the node to eliminate any overhead.

These are the commands for a minimal reproducible example:

# Create AKS cluster in West Europe
$ az aks create --name myAKSCluster --resource-group myResourceGroup -s Standard_D32ads_v5 --node-osdisk-type Ephemeral --node-osdisk-size 1200
# Get credentials for kubectl
$ az aks get-credentials --name myAKSCluster --resource-group myResourceGroup

Then, use kubectl-exec to get a shell directly on the node:

# Use kubectl-exec to get a shell on the node
$ ./kubectl-exec
root@aks-nodepool1-26530957-vmss000000:/# apt update && apt install fio
root@aks-nodepool1-26530957-vmss000000:/# fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 4 processes
Jobs: 4 (f=4): [m(4)][100.0%][r=160MiB/s,w=158MiB/s][r=40.9k,w=40.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=8515: Fri Jul 14 08:30:54 2023
  read: IOPS=40.8k, BW=159MiB/s (167MB/s)(4784MiB/30013msec)
    slat (nsec): min=1402, max=24327k, avg=35905.57, stdev=301878.58
    clat (usec): min=101, max=92076, avg=12429.66, stdev=9985.28
     lat (usec): min=108, max=92079, avg=12465.66, stdev=10006.77
    clat percentiles (usec):
     |  1.00th=[  955],  5.00th=[ 1631], 10.00th=[ 2147], 20.00th=[ 3458],
     | 30.00th=[ 5342], 40.00th=[ 7570], 50.00th=[10028], 60.00th=[12780],
     | 70.00th=[15795], 80.00th=[20055], 90.00th=[26346], 95.00th=[32113],
     | 99.00th=[43779], 99.50th=[48497], 99.90th=[58983], 99.95th=[64750],
     | 99.99th=[72877]
   bw (  KiB/s): min=96016, max=328424, per=100.00%, avg=163342.64, stdev=12166.36, samples=236
   iops        : min=24004, max=82106, avg=40835.66, stdev=3041.59, samples=236
  write: IOPS=40.8k, BW=159MiB/s (167MB/s)(4786MiB/30013msec); 0 zone resets
    slat (nsec): min=1502, max=25595k, avg=36374.11, stdev=303592.46
    clat (usec): min=62, max=92076, avg=12580.36, stdev=10015.75
     lat (usec): min=68, max=92079, avg=12616.83, stdev=10037.27
    clat percentiles (usec):
     |  1.00th=[  963],  5.00th=[ 1647], 10.00th=[ 2212], 20.00th=[ 3589],
     | 30.00th=[ 5538], 40.00th=[ 7767], 50.00th=[10290], 60.00th=[12911],
     | 70.00th=[16057], 80.00th=[20055], 90.00th=[26608], 95.00th=[32113],
     | 99.00th=[43779], 99.50th=[48497], 99.90th=[58983], 99.95th=[63701],
     | 99.99th=[72877]
   bw (  KiB/s): min=95680, max=326320, per=100.00%, avg=163413.15, stdev=12149.27, samples=236
   iops        : min=23920, max=81580, avg=40853.29, stdev=3037.32, samples=236
  lat (usec)   : 100=0.01%, 250=0.04%, 500=0.17%, 750=0.33%, 1000=0.57%
  lat (msec)   : 2=7.11%, 4=14.49%, 10=26.71%, 20=30.52%, 50=19.65%
  lat (msec)   : 100=0.40%
  cpu          : usr=2.32%, sys=8.72%, ctx=800877, majf=0, minf=70
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1224732,1225234,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=4784MiB (5017MB), run=30013-30013msec
  WRITE: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=4786MiB (5019MB), run=30013-30013msec

Disk stats (read/write):
  sda: ios=1219919/1220277, merge=2/87, ticks=11318790/11224205, in_queue=22543101, util=99.73%
daalse commented 11 months ago

AKS with ephemeral OS disk use VMSS with Standard SSD disks. It could be the cause of the performance issue.

skycaptain commented 11 months ago

Hmm. Could you please provide a reference? I have not found any mention of this limitation in the documentation and was expecting that it would run on the same SSDs as regular Azure VMs. In fact, they mention this in their blog:

[...] In addition, the ephemeral OS disk will share the IOPS with the temporary storage disk as per the VM size you selected. Ephemeral disks also require that the VM size supports Premium storage. The sizes usually have an s in the name, like DSv2 and EsV3. For more information, see Azure VM sizes for details around which sizes support Premium storage.

[...] This VM Series supports both VM cache and temporary storage SSD. High Scale VMs like DSv2-series that leverage Azure Premium Storage have a multi-tier caching technology called BlobCache. BlobCache uses a combination of the host RAM and local SSD for caching. This cache is available for the Premium Storage persistent disks and VM local disks. The VM cache can be used for hosting an ephemeral OS disk. When a VM series supports the VM cache, its size depends on the VM series and VM size. The VM cache size is indicated in parentheses next to IO throughput ("cache size in GiB").

daalse commented 11 months ago

Not sure if it is explained somewhere, but I have 2 different AKS clusters deployed, one with ephemeral OS disk. Checking its vmss configuration, I can see that the OS disk type attached is: Standard HDD LRS

On the other hand, for the other AKS cluster, the OS disk type attached is: Premium SSD LRS.

skycaptain commented 11 months ago

Unfortunately, I don't have access to the internal Resource Group created by AKS. This is due to our IT department limiting access permissions, so I can only see the AKS resource in my own resource group. However, I believe this might just be a rendering issue, as I can see the same value with regular VMs. It also says "Standard HDD LRS" but also "150000 IOPS".

Screenshot 2023-07-17 at 11 27 35

alexeldeib commented 11 months ago

the ssd thing is an implementation detail, the backing image is Standard HDD pulled to local disk

this is on my to-do list to repro but dealing with cgroupv2 issues a bit first :)

alexeldeib commented 11 months ago

specifically

We use emptyDir for our workload

this is surprising(ly bad performance for empty dir)

fsniper commented 4 months ago

We have some AKS clusters in worse ephemeral disk performance.

# fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.1
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=97.3MiB/s,w=97.5MiB/s][r=24.9k,w=24.0k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=3753471: Fri Feb  9 16:56:17 2024
   read: IOPS=27.0k, BW=109MiB/s (115MB/s)(3280MiB/30016msec)
    slat (nsec): min=1400, max=85393k, avg=36894.18, stdev=430688.70
    clat (usec): min=101, max=404282, avg=18366.00, stdev=17828.60
     lat (usec): min=113, max=404284, avg=18403.03, stdev=17831.82
    clat percentiles (usec):
     |  1.00th=[  1074],  5.00th=[  3163], 10.00th=[  5014], 20.00th=[  7832],
     | 30.00th=[  9896], 40.00th=[ 11731], 50.00th=[ 13829], 60.00th=[ 16319],
     | 70.00th=[ 19792], 80.00th=[ 25035], 90.00th=[ 35390], 95.00th=[ 47449],
     | 99.00th=[ 85459], 99.50th=[106431], 99.90th=[179307], 99.95th=[252707],
     | 99.99th=[371196]
   bw (  KiB/s): min=10024, max=38168, per=25.00%, avg=27975.48, stdev=4753.45, samples=240
   iops        : min= 2506, max= 9542, avg=6993.83, stdev=1188.36, samples=240
  write: IOPS=28.0k, BW=109MiB/s (115MB/s)(3286MiB/30016msec)
    slat (nsec): min=1600, max=80643k, avg=65559.22, stdev=532054.65
    clat (usec): min=37, max=404190, avg=18097.22, stdev=17720.12
     lat (usec): min=61, max=404201, avg=18162.95, stdev=17718.90
    clat percentiles (usec):
     |  1.00th=[   938],  5.00th=[  2933], 10.00th=[  4752], 20.00th=[  7570],
     | 30.00th=[  9634], 40.00th=[ 11469], 50.00th=[ 13566], 60.00th=[ 16057],
     | 70.00th=[ 19530], 80.00th=[ 24773], 90.00th=[ 34866], 95.00th=[ 47449],
     | 99.00th=[ 85459], 99.50th=[104334], 99.90th=[179307], 99.95th=[248513],
     | 99.99th=[371196]
   bw (  KiB/s): min=10008, max=37992, per=25.01%, avg=28029.28, stdev=4702.18, samples=240
   iops        : min= 2502, max= 9498, avg=7007.27, stdev=1175.56, samples=240
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.09%, 500=0.25%, 750=0.30%
  lat (usec)   : 1000=0.34%
  lat (msec)   : 2=1.70%, 4=4.71%, 10=23.97%, 20=39.31%, 50=24.95%
  lat (msec)   : 100=3.79%, 250=0.54%, 500=0.05%
  cpu          : usr=2.26%, sys=18.58%, ctx=1159484, majf=0, minf=8305
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwt: total=839618,841158,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=3280MiB (3439MB), run=30016-30016msec
  WRITE: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=3286MiB (3445MB), run=30016-30016msec

Disk stats (read/write):
  sda: ios=834701/836459, merge=2/156, ticks=10581685/10425060, in_queue=17674896, util=99.71%
mweibel commented 3 months ago

Thanks for opening this issue. We experience the same and could reproduce it on standard VMs.

Azure Support Request ID: 2403070050000924

Windows

windows VM (instance type Standard_D96ads_v5):

                "storageProfile": {
                    "osDisk": {
                        "createOption": "fromImage",
                        "diskSizeGB": 2040,
                        "managedDisk": {
                            "storageAccountType": "Standard_LRS"
                        },
                        "caching": "ReadOnly",
                        "diffDiskSettings": {
                            "option": "Local",
                            "placement": "ResourceDisk"
                        },
                        "deleteOption": "Delete"
                    },
                    "imageReference": {
                        "publisher": "MicrosoftWindowsServer",
                        "offer": "WindowsServer",
                        "sku": "2019-datacenter-gensecond",
                        "version": "latest"
                    }
                },

Result on OS disk drive (C:):

PS C:\Users\redacted>  fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=windowsaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=256
...
fio-3.36
Starting 4 threads
test: Laying out IO files (2 files / total 30720MiB)
Jobs: 4 (f=8): [m(4)][100.0%][r=159MiB/s,w=157MiB/s][r=40.7k,w=40.3k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=3460: Wed Mar 6 16:04:01 2024
  read: IOPS=37.5k, BW=147MiB/s (154MB/s)(4397MiB/30013msec)
    slat (nsec): min=3700, max=90600, avg=6934.13, stdev=1319.90
    clat (msec): min=2, max=836, avg=12.93, stdev= 2.65
     lat (msec): min=2, max=836, avg=12.93, stdev= 2.65
    clat percentiles (msec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   13],
     | 30.00th=[   13], 40.00th=[   13], 50.00th=[   13], 60.00th=[   14],
     | 70.00th=[   14], 80.00th=[   14], 90.00th=[   14], 95.00th=[   15],
     | 99.00th=[   15], 99.50th=[   16], 99.90th=[   17], 99.95th=[   22],
     | 99.99th=[  124]
   bw (  KiB/s): min=112901, max=165761, per=100.00%, avg=158036.82, stdev=2210.54, samples=224
   iops        : min=28223, max=41439, avg=39508.04, stdev=552.64, samples=224
  write: IOPS=37.6k, BW=147MiB/s (154MB/s)(4402MiB/30013msec); 0 zone resets
    slat (usec): min=4, max=472, avg= 7.32, stdev= 1.42
    clat (msec): min=2, max=835, avg=12.87, stdev= 2.74
     lat (msec): min=2, max=835, avg=12.88, stdev= 2.74
    clat percentiles (msec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   13],
     | 30.00th=[   13], 40.00th=[   13], 50.00th=[   13], 60.00th=[   13],
     | 70.00th=[   14], 80.00th=[   14], 90.00th=[   14], 95.00th=[   15],
     | 99.00th=[   15], 99.50th=[   16], 99.90th=[   18], 99.95th=[   24],
     | 99.99th=[  124]
   bw (  KiB/s): min=116413, max=165865, per=100.00%, avg=158249.00, stdev=2121.27, samples=224
   iops        : min=29102, max=41464, avg=39561.23, stdev=530.30, samples=224
  lat (msec)   : 4=0.01%, 10=0.10%, 20=99.83%, 50=0.01%, 250=0.05%
  lat (msec)   : 1000=0.01%
  cpu          : usr=0.00%, sys=12.50%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
     issued rwts: total=1125614,1127004,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=4397MiB (4611MB), run=30013-30013msec
  WRITE: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=4402MiB (4616MB), run=30013-30013msec

Result on tmp disk drive (D:):

PS D:\>  fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=windowsaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=256
...
fio-3.36
Starting 4 threads
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=659MiB/s,w=663MiB/s][r=169k,w=170k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7160: Wed Mar 6 16:05:24 2024
  read: IOPS=138k, BW=538MiB/s (565MB/s)(15.8GiB/30003msec)
    slat (usec): min=2, max=5876, avg= 8.82, stdev=49.05
    clat (usec): min=76, max=155071, avg=3126.87, stdev=8496.17
     lat (usec): min=114, max=155490, avg=3135.70, stdev=8529.38
    clat percentiles (usec):
     |  1.00th=[   979],  5.00th=[  1385], 10.00th=[  1532], 20.00th=[  1811],
     | 30.00th=[  2057], 40.00th=[  2212], 50.00th=[  2409], 60.00th=[  2671],
     | 70.00th=[  2802], 80.00th=[  2933], 90.00th=[  3228], 95.00th=[  5211],
     | 99.00th=[ 10945], 99.50th=[ 15008], 99.90th=[149947], 99.95th=[149947],
     | 99.99th=[152044]
   bw (  KiB/s): min=  572, max=940577, per=100.00%, avg=569406.32, stdev=72308.44, samples=228
   iops        : min=  140, max=235144, avg=142350.51, stdev=18077.13, samples=228
  write: IOPS=138k, BW=538MiB/s (564MB/s)(15.8GiB/30003msec); 0 zone resets
    slat (usec): min=2, max=6530, avg= 9.46, stdev=50.19
    clat (usec): min=8, max=155215, avg=3079.78, stdev=8620.81
     lat (usec): min=42, max=156215, avg=3089.24, stdev=8654.57
    clat percentiles (usec):
     |  1.00th=[   914],  5.00th=[  1319], 10.00th=[  1483], 20.00th=[  1745],
     | 30.00th=[  1991], 40.00th=[  2147], 50.00th=[  2343], 60.00th=[  2606],
     | 70.00th=[  2737], 80.00th=[  2868], 90.00th=[  3163], 95.00th=[  5145],
     | 99.00th=[ 10945], 99.50th=[ 15270], 99.90th=[149947], 99.95th=[149947],
     | 99.99th=[152044]
   bw (  KiB/s): min=  708, max=942222, per=100.00%, avg=569302.63, stdev=72263.71, samples=228
   iops        : min=  174, max=235555, avg=142324.65, stdev=18065.98, samples=228
  lat (usec)   : 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.02%
  lat (usec)   : 750=0.22%, 1000=1.08%
  lat (msec)   : 2=27.76%, 4=64.10%, 10=5.53%, 20=0.91%, 50=0.02%
  lat (msec)   : 100=0.02%, 250=0.33%
  cpu          : usr=6.67%, sys=28.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.7%, >=64=99.1%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.2%, 8=0.5%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
     issued rwts: total=4135353,4134254,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=538MiB/s (565MB/s), 538MiB/s-538MiB/s (565MB/s-565MB/s), io=15.8GiB (16.9GB), run=30003-30003msec
  WRITE: bw=538MiB/s (564MB/s), 538MiB/s-538MiB/s (564MB/s-564MB/s), io=15.8GiB (16.9GB), run=30003-30003msec

Linux

Same behaviour (though slightly higher speed, but not enough) on linux nodes with the same configuration.

linux VM /dev/sda:

$ fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=165MiB/s,w=166MiB/s][r=42.2k,w=42.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=5083: Mon Mar 11 14:39:24 2024
  read: IOPS=41.3k, BW=161MiB/s (169MB/s)(4844MiB/30013msec)
    slat (nsec): min=1794, max=7176.0k, avg=3447.61, stdev=14679.02
    clat (usec): min=95, max=61056, avg=12404.20, stdev=5252.66
     lat (usec): min=99, max=61062, avg=12407.73, stdev=5252.56
    clat percentiles (usec):
     |  1.00th=[ 1942],  5.00th=[ 4113], 10.00th=[ 6063], 20.00th=[ 8586],
     | 30.00th=[ 9896], 40.00th=[10945], 50.00th=[11994], 60.00th=[12911],
     | 70.00th=[14091], 80.00th=[16188], 90.00th=[19268], 95.00th=[22152],
     | 99.00th=[27919], 99.50th=[30016], 99.90th=[34866], 99.95th=[36963],
     | 99.99th=[42206]
   bw (  KiB/s): min=126600, max=218720, per=99.99%, avg=165255.52, stdev=4775.38, samples=240
   iops        : min=31650, max=54680, avg=41313.80, stdev=1193.85, samples=240
  write: IOPS=41.3k, BW=162MiB/s (169MB/s)(4848MiB/30013msec); 0 zone resets
    slat (nsec): min=1864, max=10598k, avg=3637.47, stdev=15942.32
    clat (usec): min=51, max=67756, avg=12358.53, stdev=5258.65
     lat (usec): min=58, max=67765, avg=12362.26, stdev=5258.56
    clat percentiles (usec):
     |  1.00th=[ 1893],  5.00th=[ 4047], 10.00th=[ 5997], 20.00th=[ 8455],
     | 30.00th=[ 9765], 40.00th=[10945], 50.00th=[11994], 60.00th=[12780],
     | 70.00th=[14091], 80.00th=[16057], 90.00th=[19268], 95.00th=[22152],
     | 99.00th=[27657], 99.50th=[29754], 99.90th=[34866], 99.95th=[36439],
     | 99.99th=[41681]
   bw (  KiB/s): min=125984, max=218256, per=99.99%, avg=165388.25, stdev=4803.97, samples=240
   iops        : min=31496, max=54564, avg=41346.98, stdev=1200.99, samples=240
  lat (usec)   : 100=0.01%, 250=0.03%, 500=0.08%, 750=0.10%, 1000=0.13%
  lat (msec)   : 2=0.76%, 4=3.70%, 10=26.61%, 20=60.16%, 50=8.42%
  lat (msec)   : 100=0.01%
  cpu          : usr=2.56%, sys=9.09%, ctx=855077, majf=0, minf=362
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1240047,1241056,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s), io=4844MiB (5079MB), run=30013-30013msec
  WRITE: bw=162MiB/s (169MB/s), 162MiB/s-162MiB/s (169MB/s-169MB/s), io=4848MiB (5083MB), run=30013-30013msec

Disk stats (read/write):
  sda: ios=1230258/1233558, merge=0/59, ticks=15245610/15213443, in_queue=30459065, util=99.74%

linux VM on /dev/sdb:

$ sudo fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=884MiB/s,w=884MiB/s][r=226k,w=226k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=5315: Mon Mar 11 14:44:04 2024
  read: IOPS=227k, BW=888MiB/s (931MB/s)(26.0GiB/30006msec)
    slat (nsec): min=1903, max=165258, avg=3956.31, stdev=2428.61
    clat (usec): min=101, max=38970, avg=2278.11, stdev=1649.00
     lat (usec): min=106, max=38974, avg=2282.14, stdev=1648.94
    clat percentiles (usec):
     |  1.00th=[ 1020],  5.00th=[ 1647], 10.00th=[ 1729], 20.00th=[ 1827],
     | 30.00th=[ 1893], 40.00th=[ 1958], 50.00th=[ 2008], 60.00th=[ 2073],
     | 70.00th=[ 2147], 80.00th=[ 2245], 90.00th=[ 2474], 95.00th=[ 3523],
     | 99.00th=[ 7701], 99.50th=[15926], 99.90th=[22938], 99.95th=[23462],
     | 99.99th=[24249]
   bw (  KiB/s): min=678033, max=1066016, per=100.00%, avg=909662.47, stdev=17331.74, samples=240
   iops        : min=169508, max=266504, avg=227415.52, stdev=4332.94, samples=240
  write: IOPS=227k, BW=888MiB/s (931MB/s)(26.0GiB/30006msec); 0 zone resets
    slat (nsec): min=1993, max=297325, avg=4342.53, stdev=2561.83
    clat (usec): min=48, max=38642, avg=2216.51, stdev=1645.97
     lat (usec): min=52, max=38645, avg=2220.93, stdev=1645.91
    clat percentiles (usec):
     |  1.00th=[  938],  5.00th=[ 1582], 10.00th=[ 1680], 20.00th=[ 1778],
     | 30.00th=[ 1844], 40.00th=[ 1893], 50.00th=[ 1942], 60.00th=[ 2008],
     | 70.00th=[ 2089], 80.00th=[ 2180], 90.00th=[ 2409], 95.00th=[ 3458],
     | 99.00th=[ 7635], 99.50th=[15795], 99.90th=[22938], 99.95th=[23462],
     | 99.99th=[24249]
   bw (  KiB/s): min=680023, max=1069920, per=100.00%, avg=909058.05, stdev=17212.36, samples=240
   iops        : min=170005, max=267480, avg=227264.42, stdev=4303.11, samples=240
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.10%, 500=0.22%, 750=0.31%
  lat (usec)   : 1000=0.41%
  lat (msec)   : 2=52.39%, 4=42.75%, 10=3.12%, 20=0.34%, 50=0.36%
  cpu          : usr=9.50%, sys=50.41%, ctx=2411385, majf=0, minf=408
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=6823830,6819254,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=888MiB/s (931MB/s), 888MiB/s-888MiB/s (931MB/s-931MB/s), io=26.0GiB (27.9GB), run=30006-30006msec
  WRITE: bw=888MiB/s (931MB/s), 888MiB/s-888MiB/s (931MB/s-931MB/s), io=26.0GiB (27.9GB), run=30006-30006msec

Disk stats (read/write):
  sdb: ios=6820239/6815740, merge=0/236, ticks=14524924/14052098, in_queue=28577021, util=99.73%

Our assumption would be that the speed of those two disks are similar if not the same since the OsDisk is placed on the tmp disk.

Attached is a ARM deployment template which can be used to reproduce it. I attached the linux one, if you like I can also attach the windows one. Make sure to look at parameters.json to adjust it and then apply with:

# make sure to adjust resource group and use an existing one.
az deployment group create --resource-group ephemeraltest --template-file template.json --parameters @parameters.json

parameters.json template.json

mweibel commented 2 months ago

FYI we're in contact with Azure support since a few weeks for this issue. No update yet - we haven't had the easiest time to clarify with support that it actually is an issue.

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 weeks ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 days ago

Issue needing attention of @Azure/aks-leads