clat_ns showing way higher values for 9-s when moving from ver 3.14 to ver 3.19

akoundal commented 4 years ago

While running a given workload, this is just an e.g. I am seeing way higher clat_ns values. The system , drive, OS all remains same there is no change, the only delta is the fio version .

e.g. FIO ver3.5

FIO ver3.19 / 3.15

This is the input file

[global]
name= test
ioengine=sync
log_hist_msec=1000
log_hist_coarseness=0
direct=1
thread=1
buffered=0
allow_mounted_write=1
size=100%
randrepeat=0
time_based
ramp_time=300
norandommap
refill_buffers
group_reporting
percentile_list=1.0:25.0:50.0:75.0:90.0:99.0:99.9:99.99:99.999:99.9999:99.99999:99.999999:100.0
filename=/dev/nvme0n1

[rd_rnd_qd_1_4k_1w]
stonewall
bs=4k
iodepth=1
numjobs=1
rw=randread
runtime=3300
write_hist_log=rd_rnd_qd_1_4k_1w_clat_hist.log

my run command : fio --output-format=json+ --output rd_rnd_qd_1_4k_1w_summary.log

running on linux machine CentOS 8 , kernel 5.5-4-1

Let me know if i can provide any information and help in testing out some of the patches.

I tested with v3.10, v3.12 , v3.14 and they are not showing the higher latency.

sitsofe commented 4 years ago

@akoundal do you think you can bisect this down to a commit?

akoundal commented 4 years ago

@sitsofe can you tell me the process how to bisect a commit? I have not done in before but can surely help.

sitsofe commented 4 years ago

@akoundal Re git bisection: see https://github.com/axboe/fio/issues/587#issuecomment-383567687 and below (also see https://github.com/axboe/fio/issues/805#issuecomment-514135835 and below).

akoundal commented 4 years ago

@sitsofe I have started the bisect between the fio 314 and fio 315

i will keep on adding the results here

akoundal commented 4 years ago

first header is coming as good. moving to the next one.

akoundal commented 4 years ago

@sitsofe here you go :

here is the log

sitsofe commented 4 years ago

@akoundal Hmm I don't think the issue is commit 01bf5128 - that's only changing a version number. One tip is to always do a make clean whenever you are testing a new commit to rule out the build getting confused. Could you try again?

sitsofe commented 4 years ago

I've looked at the commits between 3.14 and 3.15 and nothing really stands out...

One thing to check - does the configure run of 3.15 say it is linking against/using tcmalloc?

akoundal commented 4 years ago

@sitsofe 3.15 doesn't say the tcmalloc support as yes for the ./configure

akoundal commented 4 years ago

@sitsofe Trying the bisect again and will update if something changes I feel the change of fiosync engine may be the one as this issue is hitting the QD1, where using engine=sync

akoundal commented 4 years ago

@sitsofe It resulted in the same data after rerun on the complete the bisect I am now again compiling the 3.15 and running the test to see if this can be reproduced as it was consistently reproducible with both 3.15 and 3.19 version.

akoundal commented 4 years ago

ran again with 3.15 and now able to generate the same results. kicked off a longer run and will update the results by tomorrow.

akoundal commented 4 years ago

@sitsofe looks like i ran the data with the 3.19 the complete set and i can replicate the data with the higher latency. Looks like i may have to run the full run with 3.15 or do the git bisect with 3.14 vs 3.19

sitsofe commented 4 years ago

@akoundal that's fine. Even in the worst case that new bisect won't take much more than additional 9 (which would choose between another 512 commits) steps over the old one.

akoundal commented 4 years ago

@sitsofe yes already started should have some data by end of day Each of my run is 1hr so its taking time

akoundal commented 4 years ago

@sitsofe unfortunately I can't reproduce the issue with the 3.14 -> 3.19 git bisect i have tested all the 9 versions and ensured that fio logs were generated based on the header version. I am not getting confused why the same issue is seen when i run he completed workset with 3.19 and fails consistently but bisect doesn't reproduce :(.

akoundal commented 4 years ago

@sitsofe I ran with 3.15 the complete test and data is coming out to be good. so for sure 3.19 is broken for the runs that the test is doing. Let me know if you have any thoughts or my only hope is to test out in 3.20 version again

axboe commented 4 years ago

If 3.15 is known (for a fact) good and 3.19 is known broken, then a bisect will show the issue. There's no point in waiting for 3.20, we absolutely need a good bisect run to pin-point where the problem was introduced and then offer a fix for that. If we have that, then there's a very high likelihood that 3.20 will indeed fix it. Without that, the chance is basically zero...

So if at all possible, I'd encourage you to re-do the 3.15..3.19 bisect run. It's the only way we'll get to the bottom of this.

akoundal commented 4 years ago

@axboe I will try it again and ensure that i follow the same steps. let me just make the fresh repo go from there, i hope this gives something better here .

axboe commented 4 years ago

Also make sure you do a make clean before the make each time, just in case there's something wonky there!

akoundal commented 4 years ago

@axboe yes i am doing that make clean make make install run test

sitsofe commented 4 years ago

[...]
make clean
make

If you can arrange for the test to use the fio from the area you just built it inside, you can skip the make install step.

axboe commented 4 years ago

Probably safer to install to avoid the added hurdle of maybe then not running what you just built...

akoundal commented 4 years ago

@sitsofe @axboe i did the all the 9 git bisects again and still couldn't repeat. I will be running the fio 3.18 and see if that is also broken with my longer run and it will take a day to complete.

Also i am making sure that fio log files is matching with the commit that is happening for the bisect to ensure that what I built is being run for the test .

axboe commented 4 years ago

If the bisect isn't reliable, your test case likely isn't 100% accurate. If you get false positives for "good" (eg you have a run that has the "right" latencies, but if you did 5 runs on that version, maybe that isn't the csae), then the bisection isn't going to be reliable at all. It's absolutely key that you can say with 100% accuracy if a certain version is good or not, and doesn't sound like that's the case?

You can try the poor mans version and just try individual releases, but even for that you need a level of confidence in your ability to mark a given version good or bad.

Do keep testing though, hope we'll get there!

akoundal commented 4 years ago

@axboe
3.5 have more then 50+ same test (1 day each), with consistent data => highest confidence that test is not the issue. 3.19 multiple runs back to back give the same bad data. 3.14 back to back gives good data 3.15 back to back gives good data

Hence i am going poor man way and ensuring to see which last version is surely good, i.e. starting with 3.18 now.

sitsofe commented 4 years ago

Looking back at the screenshots attached there's something else along with the percentiles:

fio-3.15
"clat_ns": {
    "min" : 48556,
    "max" : 1912808,
    "mean", " 77815.674575,
    "stddev" : 24540.472938,

fio-3.19
"clat_ns": {
    "min" : 12299,
    "max" : 27050538,
    "mean", " 79695.065751,
    "stddev" : 140594.520868,

It's curious that fio-3.19 has a much lower minimum as well as a higher maximum (which obviously makes the standard deviation much worse)...

sitsofe commented 4 years ago

@akoundal Can you ensure you set aside a 3.15 and a 3.19 build (so you can switch between them at will). Then can you also set a fixed seed value (e.g. --seed=1 --allrandrepeat=1) so things remain deterministic when you compare 3.19 to 3.15?

akoundal commented 4 years ago

@sitsofe @axboe confirmed that ver 3.18 shows the same issue as v3.19 its showing the same high std deviation v3.18 clat_ns

akoundal commented 4 years ago

@sitsofe yes I can set it aside and its on my dev machine currently. @axboe i have started the run with the v3.17 which will follow with v3.16 , then we are done with poor man way to confirm which was the starting version which shows the issue. Hopefully we will have this answer by Sunday :)

akoundal commented 4 years ago

@sitsofe @axboe 3.17 shows the same latency issue. I am starting the 3.16 1 day run and will report back the numbers. once we have than we will surely know the starting version which is showing the issue .

akoundal commented 4 years ago

@sitsofe @axboe I have completed the longer runs and confirmed that v3.16 is working as expected. So as per my 1 week of effort and runs, i can say that v3.17 is the culprit version which shows the unexpected latency numbers. I will try to do the git bisect on v3.16 vs v3.17

sitsofe commented 4 years ago

@akoundal I've browsed the commit titles between 3.16 and 3.17 and nothing immediately stands out... Let's see what your git bisect narrows the problem commit to!

akoundal commented 4 years ago

@sitsofe Did the bisect again and unfortunately it doesn't give me anything for the quick test runs :(. I agree the comments don't show anything which may be causing it but that is for sure that 3.17, 3.18, 3.19 show the same issue consistently at my end.

akoundal commented 4 years ago

@sitsofe what else is required from my side?

akoundal commented 4 years ago

@sitsofe I am also seeing that using the json+ output format with log_histogram is resulting in zero values being dumped into the log_histogram files, could this be the reason of something going weird in QoS reporting as it uses the IOPS in each bin for reporting QoS?

sitsofe commented 4 years ago

@akoundal We would need either a successful bisection which pointed to a single commit or steps that others can run that quickly and reliably reproduce the issue without a huge setup (e.g. makes the issue happen in a minute on even boring slow hard disks).

swetharajendiran commented 4 years ago

@sitsofe - Any update on this issue? Based on the debug done in #974 ?

sitsofe commented 4 years ago

@swetharajendiran did you see https://github.com/axboe/fio/issues/974#issuecomment-627208905 - did checking out the latest fio and reverting just that commit (or checking out that commit and reverting it) definitely solve your issue (i.e. with it you see the issue and without you don't)? Eyeballing, it seems unlikely...

sitsofe commented 4 years ago

@swetharajendiran ping?

sitsofe commented 4 years ago

@akoundal Something that has occurred to me regarding the higher values is that the histograms values could be corrupt before fio 3.17 (see #827 for the bug report, problem was introduced by https://github.com/axboe/fio/commit/6cc0e5aa9eddd2487dfa9ac80be1264151058409 in fio-3.6 and was fixed by https://github.com/axboe/fio/commit/1fb9250b81aa7c655cb62ab26647a6ad54a279d5 in fio 3.17). The corruption was based on unwritten data so it the impact would appear to be random. Maybe if the corruption was favourable then values would have looked abnormally low...

akoundal commented 4 years ago

@sitsofe yes that is for sure that its before the 3.17 as i saw that 3.14 onwards it was busted for the runs I was doing Unfortunately if we say that fix is already there but I am still able to hit/see the issue in the newer releases. @swetharajendiran will be updating her story also soon and we should have some more data for your analysis we really want to move to the new version but this is blocking us to move and utilize the other goodness of new FIO version is offering.

swetharajendiran commented 4 years ago

@sitsofe - It was pretty evident with that particular commit, since I did 3 back to back runs and still hit the issue. I can do few more runs and will post the data in a couple of days.

swetharajendiran commented 4 years ago

@sitsofe I tried to run the same workload again with the suspected bad commit and the good version. I am not able to reproduce the issue by just switching to the suspected bad commit.

axboe / fio

clat_ns showing way higher values for 9-s when moving from ver 3.14 to ver 3.19 #955