axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.26k stars 1.26k forks source link

FIO throw out error "clock setaffinity failed: Invalid argument" before it starts to run on Solaris 11.4 #796

Open lic34 opened 5 years ago

lic34 commented 5 years ago

FIO profile

[global] ioengine=solarisaio iodepth=16 direct=1 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 thread runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote_24] filename=/dev/dsk/emcpower0c size=86%

Error:

Assertion failed: io_u->flags & IO_U_F_FLIGHT, file io_u.c, line 1374, function io_completed Abort (core dumped)

However, when it performs IO to one LUN, it works properly.

sitsofe commented 5 years ago

@lic34 I'm afraid we would need at least the full backtrace to be able to take this one further. If it's somewhat deterministic then we might be able to turn debug=io on too for some more clues...

sitsofe commented 5 years ago

Get a backtrace of it (I don't have a solaris system so the core itself wouldn't be too useful). Something like

gdb --core=path/to/core path/to/fio/binary
thread apply all bt
quit

And copy pasting the backtrace here would do (note those commands are a rough guess).

lic34 commented 5 years ago

The backtrace is as following:(Moreover, I'm using FIO version with 2.0.14)

(gdb) bt

0 0xfdf5db85 in __nanosleep () from /lib/libc.so.1

1 0xfdf48943 in nanosleep () from /lib/libc.so.1

2 0xfdf48ce3 in usleep () from /lib/libc.so.1

3 0x0808edf3 in ?? ()

4 0x0808f1ab in fio_backend ()

5 0x08094ac6 in main ()

sitsofe commented 5 years ago

@lic34:

Moreover, I'm using FIO version with 2.0.14

Ah that is important! I'm afraid we (upstream fio) don't support any version of fio other than the latest (at this stage your version of fio is over six years old). I would strongly suggest seeing if the problem happens in newer version of fio (see https://github.com/axboe/fio/releases for which release we're up to now).

lic34 commented 5 years ago

I got FIO with version 2.0.14 from this page [https://www.opencsw.org/packages/fio/] for my Solaris 11.4 host, but you recommend me to use the latest version. So, what's I'm concerned is that, does the FIO with the latest version support Solaris 11.4?

sitsofe commented 5 years ago

@lic34 Theoretically the latest one if you apply the fixes mentioned in #714 ...

szaydel commented 5 years ago

@lic34, I have not had issues building recent checkouts from master on Illumos, and while that does not guarantee anything, I expect that it should build on Solaris as well. If it does not, perhaps it is worth logging an issue with details from failure and I or someone else may be able to help.

lic34 commented 5 years ago

@sitsofe @szaydel thank you for your help~!!

I tried to use the latest version(3.14) on my Solaris host, and I hit the same issue as #714,

I changed the code as @sitsofe's suggestion:

changing

#define os_ctime_r(x, y, z) ctime_r((x), (y), (z))

to

#define os_ctime_r(x, y, z) ctime_r((x), (y))

It installs successfully.

However, when I run the FIO process, it reported an error as

clock setaffinity failed: Invalid argument

sitsofe commented 5 years ago

@lic34 Is there any more context to go along with that? I'm going to guess it's something to do with https://github.com/axboe/fio/blob/a0318e176ace1e53b0049965f63b0c60b55ff8cd/gettime.c#L588 so by extension https://github.com/axboe/fio/blob/de5ed0e4d398bc9d4576f9b2b82d7686989c27e1/os/os-solaris.h#L108 but we'd have to know which argument was bad and what the correct thing looks like...

lic34 commented 5 years ago

It seems the FIO keep running after the error throwed out, however there is only one process(job) was running, even there are more than two Luns in the profile.

root@ncvm9084105:/opt/csw/bin/fio-log# fio --output /opt/csw/bin/fio-log/test.log /opt/csw/bin/fio-log/test.fio clock setaffinity failed: Invalid argument Jobs: 1 (f=1): [X(1),m(1)][7.4%][r=16.8MiB/s,w=16.7MiB/s][r=4305,w=4280 IOPS][eta 43m:22s]

sitsofe commented 5 years ago

@lic34 Hmm... looking through https://docs.oracle.com/cd/E88353_01/html/E37841/pset-bind-2.html I wonder if fio is trying to bind to a cpu that your system has disallowed (I'm guessing that call is returning EINVAL). Could you simplify your job file further (i.e. remove as much as possible) so we have the bare minimum that still makes the current failing job issue happen?

(I'm likely busy for the rest of the day but hopefully others will chip in if they can)

sitsofe commented 5 years ago

(Mentioning @alanc as a long shot)

lic34 commented 5 years ago

I simplified the profile likes:

_ [global] ioengine=solarisaio iodepth=16 direct=0 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 thread runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote20] filename=/dev/dsk/emcpower1c size=86%

However, it seems it has the same issue:

clock setaffinity failed: Invalid argument Jobs: 1 (f=1): [m(1)][0.2%][r=12.9MiB/s,w=12.9MiB/s][r=3293,w=3296 IOPS][eta 46m:43s]

sitsofe commented 5 years ago

@lic34 if you keep both jobs and remove options like direct, fill_device, refill_buffers, bs_unaligned, thread, numjobs (basically as many options as possible) etc. what's the smallest job that results in the premature death of one of the threads?

lic34 commented 5 years ago

I removed all options as following:

_[global] ioengine=solarisaio [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] size=86% filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote24] filename=/dev/dsk/emcpower0c size=86%

But the error is still here:

_# fio --output /opt/csw/bin/fio-log/id_1562061394_912506_4191_25.log /opt/csw/bin/fio-log/id_1562061394_107510_419125.fio clock setaffinity failed: Invalid argument Jobs: 5 (f=5): [R(5)][3.0%][r=50.9MiB/s][r=13.0k IOPS][eta 01h:12m:51s]

szaydel commented 5 years ago

I recall seeing something to this end. I will see if it happens on illumos with multiple cores, where I think I observed it also, and if I have some spare time I will see if I can get Solaris going also.

lic34 commented 5 years ago

Thanks @szaydel for your support! Moreover, I find another issue when I was using the FIO on my Solaris 11.4, please refer to #798 for more details.

szaydel commented 5 years ago

Segfault is not something I expected to see, but maybe it is somehow related to this issue. I will try to mess with it this weekend.

szaydel commented 5 years ago

What happens if you change to say posixaio, and still use threads?

lic34 commented 5 years ago

Sorry for my late reply.

I change the IO engine from solarisaio to posixaio and add thread, it seems the fio is working property after throwing an error as "clock setaffinity failed: Invalid argument".

The output when the process begins run:

root@ncvm9084105:/opt/csw/bin/fio-log# fio --output debug.log debug.fio clock setaffinity failed: Invalid argument Jobs: 5 (f=5): [m(5)][24.4%][r=6782KiB/s,w=6898KiB/s][r=1695,w=1724 IOPS][eta 35m:22s]

FIO parameters:

_root@ncvm9084105:/opt/csw/bin/fio-log# cat debug.fio [global] ioengine=posixaio iodepth=16 direct=0 thread bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote24] filename=/dev/dsk/emcpower0c size=86%

szaydel commented 5 years ago

I apologize, myself. I have been really busy over these last few days. I am not reproducing this issue on illumos, which means, I think the problem is specific to Solaris. If I have a little bit of time, I will try with Solaris 11.4. I think the clock affinity error and the crash you were seeing are two separate things, and it would be useful to figure out what is really causing that crash.