axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.23k stars 1.26k forks source link

version 3.18 sets stonewall incorrectly on s390x #1065

Closed tdi2004 closed 4 years ago

tdi2004 commented 4 years ago

Hi @hannesweisbach, assume the following jobfile:

[global]
blocksize_range=4k-32k
rw=randrw
percentage_random=50
rwmixread=50
iodepth=4
ioengine=libaio
verify=crc32c
verify_fatal=1
verify_dump=1
verify_backlog=100000
time_based
runtime=14d
exitall_on_error=1
numjobs=1

[LV1]
size=133722k
directory=/mnt1/LV1 

[LV2]
size=124315k
directory=/mnt1/LV2

[LV3]
size=124873k
directory=/mnt1/LV3

[LV4]
size=133380k
directory=/mnt1/LV4

[LV5]
size=168411k
directory=/mnt1/LV5

with commit 64402a8a7b158c1b7a778b5f2508e625a6fc6317 a change in behaviour was introduced. While older fio versions started the fio parent PID and one fio child PID per each directory to run fio on several mountpoints in parallel, this is now broken. fio runs only on the first directory specified:

root@t83lp32:~/fio-fio-3.18# ./fio jobfile.fio 
LV1: (g=0): rw=randrw, bs=(R) 4096B-32.0KiB, (W) 4096B-32.0KiB, (T) 4096B-32.0KiB, ioengine=libaio, iodepth=4
LV2: (g=1): rw=randrw, bs=(R) 4096B-32.0KiB, (W) 4096B-32.0KiB, (T) 4096B-32.0KiB, ioengine=libaio, iodepth=4
LV3: (g=2): rw=randrw, bs=(R) 4096B-32.0KiB, (W) 4096B-32.0KiB, (T) 4096B-32.0KiB, ioengine=libaio, iodepth=4
LV4: (g=3): rw=randrw, bs=(R) 4096B-32.0KiB, (W) 4096B-32.0KiB, (T) 4096B-32.0KiB, ioengine=libaio, iodepth=4
LV5: (g=4): rw=randrw, bs=(R) 4096B-32.0KiB, (W) 4096B-32.0KiB, (T) 4096B-32.0KiB, ioengine=libaio, iodepth=4
fio-3.18
Starting 5 processes
Jobs: 1 (f=1): [m(1),P(4)][0.0%][r=95.0MiB/s,w=48.1MiB/s][r=5726,w=2854 IOPS][eta 69d:23h:59m:46s]    

It shows only one job! Previous version showed five in my case. Same for pidof: pidof fio shows only 2 (parent and one child) instead of 6 (parent and five children) killing the parent pid shows in fio, that only LV1 was used:

fio: terminating on signal 15
Jobs: 0 (f=0)
LV1: (groupid=0, jobs=1): err= 0: pid=647846: Thu Aug 13 17:47:45 2020
  read: IOPS=3206, BW=53.7MiB/s (56.3MB/s)(10.3GiB/196577msec)
    slat (nsec): min=0, max=42050M, avg=257430.22, stdev=73773012.21
    clat (usec): min=6, max=42097k, avg=787.08, stdev=127816.71
     lat (usec): min=18, max=42098k, avg=1044.69, stdev=147600.51
    clat percentiles (usec):
     |  1.00th=[   59],  5.00th=[  135], 10.00th=[  178], 20.00th=[  231],
     | 30.00th=[  277], 40.00th=[  322], 50.00th=[  388], 60.00th=[  461],
     | 70.00th=[  498], 80.00th=[  529], 90.00th=[  562], 95.00th=[  586],
     | 99.00th=[  644], 99.50th=[  668], 99.90th=[  775], 99.95th=[ 6194],
     | 99.99th=[25560]
   bw (  KiB/s): min=10717, max=55065, per=85.28%, avg=46924.46, stdev=5540.73, samples=223
   iops        : min=  647, max= 3191, avg=2741.35, stdev=323.20, samples=223
  write: IOPS=5899, BW=99.1MiB/s (104MB/s)(5422MiB/54706msec); 0 zone resets
    slat (usec): min=11, max=248, avg=51.57, stdev=26.86
    clat (nsec): min=1000, max=30439k, avg=259321.41, stdev=311162.57
     lat (usec): min=14, max=30486, avg=310.93, stdev=312.50
    clat percentiles (usec):
     |  1.00th=[   40],  5.00th=[   95], 10.00th=[  131], 20.00th=[  176],
     | 30.00th=[  210], 40.00th=[  237], 50.00th=[  260], 60.00th=[  281],
     | 70.00th=[  306], 80.00th=[  334], 90.00th=[  371], 95.00th=[  396],
     | 99.00th=[  445], 99.50th=[  465], 99.90th=[  553], 99.95th=[  766],
     | 99.99th=[18220]
   bw (  KiB/s): min=11713, max=57782, per=48.67%, avg=49396.36, stdev=5828.41, samples=223
   iops        : min=  683, max= 3342, avg=2870.70, stdev=339.23, samples=223
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.04%, 20=0.21%, 50=0.63%
  lat (usec)   : 100=2.76%, 250=27.72%, 500=49.29%, 750=19.22%, 1000=0.04%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.02%
  lat (msec)   : 2000=0.01%, >=2000=0.01%
  cpu          : usr=14.74%, sys=6.26%, ctx=652295, majf=0, minf=57
  IO depths    : 1=0.1%, 2=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=630291,322712,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=53.7MiB/s (56.3MB/s), 53.7MiB/s-53.7MiB/s (56.3MB/s-56.3MB/s), io=10.3GiB (11.1GB), run=196577-196577msec
  WRITE: bw=99.1MiB/s (104MB/s), 99.1MiB/s-99.1MiB/s (104MB/s-104MB/s), io=5422MiB (5685MB), run=54706-54706msec

Run status group 1 (all jobs):

Run status group 2 (all jobs):

Run status group 3 (all jobs):

Run status group 4 (all jobs):

Disk stats (read/write):
    dm-42: ios=1501888/711229, merge=0/0, ticks=298820/328030, in_queue=626850, util=90.00%, aggrios=752390/356614, aggrmerge=0/0, aggrticks=149070/164570, aggrin_queue=313640, aggrutil=89.56%
    dm-1: ios=752172/355222, merge=0/0, ticks=86620/193730, in_queue=280350, util=89.56%, aggrios=510346/127313, aggrmerge=241826/227891, aggrticks=60705/108070, aggrin_queue=64310, aggrutil=89.04%
    dm-0: ios=510346/127313, merge=241826/227891, ticks=60705/108070, in_queue=64310, util=89.04%, aggrios=127608/31826, aggrmerge=0/0, aggrticks=60663/10801, aggrin_queue=47185, aggrutil=75.46%
  sdag: ios=128399/31784, merge=0/0, ticks=40480/10901, in_queue=26670, util=69.87%
  sdq: ios=126825/31380, merge=0/0, ticks=50803/10817, in_queue=37360, util=75.46%
  sda: ios=126817/31945, merge=0/0, ticks=76774/10718, in_queue=63620, util=69.88%
  sdaw: ios=128392/32198, merge=0/0, ticks=74598/10768, in_queue=61090, util=69.30%
    dm-3: ios=752609/358007, merge=0/0, ticks=211520/135410, in_queue=346930, util=89.31%, aggrios=516773/124292, aggrmerge=235836/233689, aggrticks=144446/44814, aggrin_queue=84340, aggrutil=89.31%
    dm-2: ios=516773/124292, merge=235836/233689, ticks=144446/44814, in_queue=84340, util=89.31%, aggrios=129215/31071, aggrmerge=0/0, aggrticks=65874/10714, aggrin_queue=52195, aggrutil=75.55%
  sdb: ios=128423/31267, merge=0/0, ticks=93050/10612, in_queue=79690, util=75.55%
  sdax: ios=130008/31426, merge=0/0, ticks=79287/10692, in_queue=65580, util=69.38%
  sdah: ios=130003/30966, merge=0/0, ticks=39890/10773, in_queue=25880, util=69.42%
  sdr: ios=128426/30627, merge=0/0, ticks=51272/10781, in_queue=37630, util=75.50%
root@t83lp32:~/fio-fio-3.18# 
sitsofe commented 4 years ago

Hi @tdi2004,

Could you:

I tried with 84abeef7d9f2f5cded36dcfc127b3f33db89ea57 and the following but could not reproduce your issue:

fio --rwmixread=50 --numjobs=1 --rw=randrw --exitall_on_error=1 --runtime=1d --time_based=1 --size=10M --name=1 --directory=1 --name=2 --directory=2
tdi2004 commented 4 years ago

Hi @sitsofe ,

  1. yes, I tried that with 3.19 and 3.21
  2. no impact:
    # ~/fio-fio-3.21/fio --rwmixread=50 --numjobs=1 --exitall_on_error=1 --runtime=30s --time_based=1 --size=10M --name=LV1 --directory=/mnt1/LV1 --name=LV2 --directory=/mnt1/LV2
    LV1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
    LV2: (g=1): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
    fio-3.21
    Starting 2 processes

    The following output line pointed me to the conclusion, that fio now starts the jobs on various directories sequentially (instead of parallel, like it did before):

    Jobs: 1 (f=1): [R(1),P(1)][4.9%][r=1270MiB/s][r=325k IOPS][eta 00m:58s]

=> almost 60s = 2*30s and after 60s both directories had been treated by fio for 30 seconds. So, the parallel execution doesn't work with a commandline invocation either.

sitsofe commented 4 years ago

@tdi2004 ,

I still can't reproduce what you're seeing but I think the job can be minimised further. Let's check a few things:

hannesweisbach commented 4 years ago

Hi @tdi2004,

unfortunately I'm also not able to reproduce the behaviour you're describing. Using git tag fio-3.21 I get the following output:

time fio --rwmixread=50 --numjobs=1 --exitall_on_error=1 --runtime=30s --time_based=1 --size=10M --name=1   --directory=/tmp/1   --name=2 --directory=/tmp/2 --rw=randrw
1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
2: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1

Runtime is, as you expect, roughly 30s, so the jobs do infact run in parallel. What I noticed is that your jobs run in different groups (notice the g= in parentheses at the beginning of each line), where your jobs run in different groups. I have no idea why that is … yet :)

Edit: I also tested current master g5c79. Result: "Works for me". (Although that probably doesn't help much.)

tdi2004 commented 4 years ago

# ~/fio-fio-3.21/fio --runtime=5 --time_based=1 --size=10M --name=1 --directory=/tmp/1 --name=2 --directory=/tmp/2

ran a job on /tmp/1 for 5s when finished, it ran a another job on /tmp/2 for 5s. Jobs were not executed in parallel.

tdi2004 commented 4 years ago

# ~/fio-fio-3.21/fio --debug=process --output-format=json --runtime=5 --time_based=1 --size=10M --name=1 --directory=/tmp/1 --name=2 --directory=/tmp/2 > fio.log attaching fio.log fio.log

tdi2004 commented 4 years ago

Runtime is, as you expect, roughly 30s,

Not here. That's the culprit. For me it is 60s. My architecture is s390x - I can't imagine, that the forking stuff is architecture dependent.

time ~/fio-fio-3.21/fio --rwmixread=50 --numjobs=1 --exitall_on_error=1 --runtime=30s --time_based=1 --size=10M --name=1   --directory=/tmp/1   --name=2 --directory=/tmp/2 --rw=randrw --output=/dev/null
1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
Jobs: 1 (f=1): [_(1),m(1)][100.0%][r=20.1MiB/s,w=20.4MiB/s][r=5152,w=5222 IOPS][eta 00m:00s]
real    1m1.059s
user    0m1.391s
sys 0m5.880s
hannesweisbach commented 4 years ago

Ah, I was about to ask about your compiler/architecture. I was expecting Linux/x86_64.

@sitsofe The log showed that stonewall is set, which shouldn't be set. I suspect that the issue has to do with cconv, because the value is 16 bit. There might be something funky going on. 64402a8 changed the offset/alignment of top->stonewall.

sitsofe commented 4 years ago

@hannesweisbach good sleuthing. The type of stonewall is FIO_OPT_STR_SET and the type of exit_what is FIO_OPT_STR too. I wonder if the option parsing is writing over the adjacent variable?

sitsofe commented 4 years ago

Comping with

CC=clang ./configure --disable-optimizations --extra-cflags="-fsanitize=memory -fno-omit-frame-pointer -g" && make -j 8

then running

./fio --debug=all --runtime=5 --time_based=1 --size=10M --name=1 --directory=/tmp/1 --name=2 --directory=/tmp/2```

stopped with the following error:

fio: set all debug options
==9921==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x54212c in parse_cmd_line /fio/init.c:2659:35
    #1 0x549cfa in parse_options /fio/init.c:2975:14
    #2 0x8b9b9d in main /fio/fio.c:42:6
    #3 0x7ff3af8cb0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
    #4 0x42698d in _start (/fio/fio+0x42698d)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /fio/init.c:2659:35 in parse_cmd_line
Exiting

So that's a starting point... which turned out to be a dead end.

XeS0r commented 4 years ago

The issue should be located in int __handle_option(). It writes to an int pointer when it handles FIO_OPT_STR for the exit_what default value. It could also be an issue that FIO_OPT_STR_SET for stonewall now writes to other fields as well because it also uses an int pointer. If you also think this is the issue, then I can send you the patch I've already written and tested for this issue.

hannesweisbach commented 4 years ago

@sitsofe @XeS0r

I agree. Call stack is:

__handle_option()
val_store()
td_var()

the latter creates in int * which points to an unsigned short. val_store writes to this pointer of wrong size. I have no idea, why this isn't an issue on x86/_64.

XeS0r commented 4 years ago

I guess it has another member alignment or it has less impact because it's little endian. Do you want me to open a pull request?

hannesweisbach commented 4 years ago

Sure, go ahead. Although I'm not a maintainer.

sitsofe commented 4 years ago

@hannesweisbach x86/amd64 is notorious for quietly fixing up alignment issues at a minor speed penalty (https://stackoverflow.com/a/7517370 ).

hannesweisbach commented 4 years ago

@sitsofe Sure, int has no alignment requirement on x86, but it's also a write of the wrong size.

sitsofe commented 4 years ago

@tdi2004 thanks for persevering, @hannesweisbach, @XeS0r thanks for sorting this one!

Out of idle curiosity I checked what happened before this fix by running the following

fio --exit_what=all --ioengine=null --size=1T --name=1 --name=2

This showed the jobs going into different groups and --debug=process showed stonewall was active. To me this proves @XeS0r's endianness theory from https://github.com/axboe/fio/issues/1065#issuecomment-674069311 because a setting of of all inserts a value of -1 (which sets all bits to 1 in an unsigned int) and causes stonewall to become set too due to overrun. Normally exit_what defaults to 1 and on a big endian machine like s390x (where the least significant bit ends up in the highest address) this means it would have set exit_what to 0 and stonewall to 1 due to the overrun.

After the fix in fd56c235caa42870e6dc33d661514375ea95ffc5 the above no longer causes two different groups to be made and there's no stonewall output when looking through the debug.

I had hoped that additional tooling could help detect this case but it doesn't look like it:

#include <stdio.h> 
#include <stddef.h>
struct st {
        unsigned long long l1;
        unsigned short s1;
        unsigned short s2;
        unsigned int i1;
} comp = {.l1 = 0, .s1 = 1, .s2 = 2, .i1 = 3};

int main(void) {
        unsigned int *ip;
        void *p;

        p = (char *) &comp + offsetof(struct st, s1);
        ip = p;
        *ip = 444 << 16 | 777;
        printf("Address of .s1: %p .s2: %p\n", &(comp.s1), &(comp.s2));
        printf("Value of .l1: %lld .s1: %d .s2: %d i1: %d\n", comp.l1, comp.s1,
               comp.s2, comp.i1);
}

https://github.com/google/sanitizers/wiki/AddressSanitizerIntraObjectOverflow suggests only limited cases are catered for when it comes to intra object overflow (e.g. within a struct). Because it also defeats the type system I suspect this would be even tougher to detect statically.

hannesweisbach commented 4 years ago

Jup. I though I was being smart using uint16_ts … going 32-bit for those options is probably the right call.