Open lic34 opened 5 years ago
@lic34 I'm afraid we would need at least the full backtrace to be able to take this one further. If it's somewhat deterministic then we might be able to turn debug=io on too for some more clues...
Get a backtrace of it (I don't have a solaris system so the core itself wouldn't be too useful). Something like
gdb --core=path/to/core path/to/fio/binary
thread apply all bt
quit
And copy pasting the backtrace here would do (note those commands are a rough guess).
The backtrace is as following:(Moreover, I'm using FIO version with 2.0.14)
(gdb) bt
@lic34:
Moreover, I'm using FIO version with 2.0.14
Ah that is important! I'm afraid we (upstream fio) don't support any version of fio other than the latest (at this stage your version of fio is over six years old). I would strongly suggest seeing if the problem happens in newer version of fio (see https://github.com/axboe/fio/releases for which release we're up to now).
I got FIO with version 2.0.14 from this page [https://www.opencsw.org/packages/fio/] for my Solaris 11.4 host, but you recommend me to use the latest version. So, what's I'm concerned is that, does the FIO with the latest version support Solaris 11.4?
@lic34 Theoretically the latest one if you apply the fixes mentioned in #714 ...
@lic34, I have not had issues building recent checkouts from master on Illumos, and while that does not guarantee anything, I expect that it should build on Solaris as well. If it does not, perhaps it is worth logging an issue with details from failure and I or someone else may be able to help.
@sitsofe @szaydel thank you for your help~!!
I tried to use the latest version(3.14) on my Solaris host, and I hit the same issue as #714,
I changed the code as @sitsofe's suggestion:
changing
#define os_ctime_r(x, y, z) ctime_r((x), (y), (z))
to
#define os_ctime_r(x, y, z) ctime_r((x), (y))
It installs successfully.
However, when I run the FIO process, it reported an error as
clock setaffinity failed: Invalid argument
@lic34 Is there any more context to go along with that? I'm going to guess it's something to do with https://github.com/axboe/fio/blob/a0318e176ace1e53b0049965f63b0c60b55ff8cd/gettime.c#L588 so by extension https://github.com/axboe/fio/blob/de5ed0e4d398bc9d4576f9b2b82d7686989c27e1/os/os-solaris.h#L108 but we'd have to know which argument was bad and what the correct thing looks like...
It seems the FIO keep running after the error throwed out, however there is only one process(job) was running, even there are more than two Luns in the profile.
root@ncvm9084105:/opt/csw/bin/fio-log# fio --output /opt/csw/bin/fio-log/test.log /opt/csw/bin/fio-log/test.fio clock setaffinity failed: Invalid argument Jobs: 1 (f=1): [X(1),m(1)][7.4%][r=16.8MiB/s,w=16.7MiB/s][r=4305,w=4280 IOPS][eta 43m:22s]
@lic34 Hmm... looking through https://docs.oracle.com/cd/E88353_01/html/E37841/pset-bind-2.html I wonder if fio is trying to bind to a cpu that your system has disallowed (I'm guessing that call is returning EINVAL
). Could you simplify your job file further (i.e. remove as much as possible) so we have the bare minimum that still makes the current failing job issue happen?
(I'm likely busy for the rest of the day but hopefully others will chip in if they can)
(Mentioning @alanc as a long shot)
I simplified the profile likes:
_ [global] ioengine=solarisaio iodepth=16 direct=0 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 thread runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote20] filename=/dev/dsk/emcpower1c size=86%
However, it seems it has the same issue:
clock setaffinity failed: Invalid argument Jobs: 1 (f=1): [m(1)][0.2%][r=12.9MiB/s,w=12.9MiB/s][r=3293,w=3296 IOPS][eta 46m:43s]
@lic34 if you keep both jobs and remove options like direct, fill_device, refill_buffers, bs_unaligned, thread, numjobs (basically as many options as possible) etc. what's the smallest job that results in the premature death of one of the threads?
I removed all options as following:
_[global] ioengine=solarisaio [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] size=86% filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote24] filename=/dev/dsk/emcpower0c size=86%
But the error is still here:
_# fio --output /opt/csw/bin/fio-log/id_1562061394_912506_4191_25.log /opt/csw/bin/fio-log/id_1562061394_107510_419125.fio clock setaffinity failed: Invalid argument Jobs: 5 (f=5): [R(5)][3.0%][r=50.9MiB/s][r=13.0k IOPS][eta 01h:12m:51s]
I recall seeing something to this end. I will see if it happens on illumos with multiple cores, where I think I observed it also, and if I have some spare time I will see if I can get Solaris going also.
Thanks @szaydel for your support! Moreover, I find another issue when I was using the FIO on my Solaris 11.4, please refer to #798 for more details.
Segfault is not something I expected to see, but maybe it is somehow related to this issue. I will try to mess with it this weekend.
What happens if you change to say posixaio
, and still use threads?
Sorry for my late reply.
I change the IO engine from solarisaio to posixaio and add thread, it seems the fio is working property after throwing an error as "clock setaffinity failed: Invalid argument".
The output when the process begins run:
root@ncvm9084105:/opt/csw/bin/fio-log# fio --output debug.log debug.fio clock setaffinity failed: Invalid argument Jobs: 5 (f=5): [m(5)][24.4%][r=6782KiB/s,w=6898KiB/s][r=1695,w=1724 IOPS][eta 35m:22s]
FIO parameters:
_root@ncvm9084105:/opt/csw/bin/fio-log# cat debug.fio [global] ioengine=posixaio iodepth=16 direct=0 thread bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote24] filename=/dev/dsk/emcpower0c size=86%
I apologize, myself. I have been really busy over these last few days. I am not reproducing this issue on illumos, which means, I think the problem is specific to Solaris. If I have a little bit of time, I will try with Solaris 11.4. I think the clock affinity error and the crash you were seeing are two separate things, and it would be useful to figure out what is really causing that crash.
FIO profile
[global] ioengine=solarisaio iodepth=16 direct=1 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 thread runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote_24] filename=/dev/dsk/emcpower0c size=86%
Error:
Assertion failed: io_u->flags & IO_U_F_FLIGHT, file io_u.c, line 1374, function io_completed Abort (core dumped)
However, when it performs IO to one LUN, it works properly.