Open akoundal opened 4 years ago
@akoundal do you think you can bisect this down to a commit?
@sitsofe can you tell me the process how to bisect a commit? I have not done in before but can surely help.
@akoundal Re git bisection: see https://github.com/axboe/fio/issues/587#issuecomment-383567687 and below (also see https://github.com/axboe/fio/issues/805#issuecomment-514135835 and below).
@sitsofe I have started the bisect between the fio 314 and fio 315
i will keep on adding the results here
first header is coming as good. moving to the next one.
@sitsofe here you go :
here is the log
@akoundal Hmm I don't think the issue is commit 01bf5128 - that's only changing a version number. One tip is to always do a make clean whenever you are testing a new commit to rule out the build getting confused. Could you try again?
I've looked at the commits between 3.14 and 3.15 and nothing really stands out...
One thing to check - does the configure
run of 3.15 say it is linking against/using tcmalloc
?
@sitsofe 3.15 doesn't say the tcmalloc support as yes for the ./configure
@sitsofe Trying the bisect again and will update if something changes I feel the change of fiosync engine may be the one as this issue is hitting the QD1, where using engine=sync
@sitsofe It resulted in the same data after rerun on the complete the bisect I am now again compiling the 3.15 and running the test to see if this can be reproduced as it was consistently reproducible with both 3.15 and 3.19 version.
ran again with 3.15 and now able to generate the same results. kicked off a longer run and will update the results by tomorrow.
@sitsofe looks like i ran the data with the 3.19 the complete set and i can replicate the data with the higher latency. Looks like i may have to run the full run with 3.15 or do the git bisect with 3.14 vs 3.19
@akoundal that's fine. Even in the worst case that new bisect won't take much more than additional 9 (which would choose between another 512 commits) steps over the old one.
@sitsofe yes already started should have some data by end of day Each of my run is 1hr so its taking time
@sitsofe unfortunately I can't reproduce the issue with the 3.14 -> 3.19 git bisect i have tested all the 9 versions and ensured that fio logs were generated based on the header version. I am not getting confused why the same issue is seen when i run he completed workset with 3.19 and fails consistently but bisect doesn't reproduce :(.
@sitsofe I ran with 3.15 the complete test and data is coming out to be good. so for sure 3.19 is broken for the runs that the test is doing. Let me know if you have any thoughts or my only hope is to test out in 3.20 version again
If 3.15 is known (for a fact) good and 3.19 is known broken, then a bisect will show the issue. There's no point in waiting for 3.20, we absolutely need a good bisect run to pin-point where the problem was introduced and then offer a fix for that. If we have that, then there's a very high likelihood that 3.20 will indeed fix it. Without that, the chance is basically zero...
So if at all possible, I'd encourage you to re-do the 3.15..3.19 bisect run. It's the only way we'll get to the bottom of this.
@axboe I will try it again and ensure that i follow the same steps. let me just make the fresh repo go from there, i hope this gives something better here .
Also make sure you do a make clean before the make each time, just in case there's something wonky there!
@axboe yes i am doing that make clean make make install run test
[...]
make clean
make
If you can arrange for the test to use the fio from the area you just built it inside, you can skip the make install
step.
Probably safer to install to avoid the added hurdle of maybe then not running what you just built...
@sitsofe @axboe i did the all the 9 git bisects again and still couldn't repeat. I will be running the fio 3.18 and see if that is also broken with my longer run and it will take a day to complete.
Also i am making sure that fio log files is matching with the commit that is happening for the bisect to ensure that what I built is being run for the test .
If the bisect isn't reliable, your test case likely isn't 100% accurate. If you get false positives for "good" (eg you have a run that has the "right" latencies, but if you did 5 runs on that version, maybe that isn't the csae), then the bisection isn't going to be reliable at all. It's absolutely key that you can say with 100% accuracy if a certain version is good or not, and doesn't sound like that's the case?
You can try the poor mans version and just try individual releases, but even for that you need a level of confidence in your ability to mark a given version good or bad.
Do keep testing though, hope we'll get there!
@axboe
3.5 have more then 50+ same test (1 day each), with consistent data => highest confidence that test is not the issue.
3.19 multiple runs back to back give the same bad data.
3.14 back to back gives good data
3.15 back to back gives good data
Hence i am going poor man way and ensuring to see which last version is surely good, i.e. starting with 3.18 now.
Looking back at the screenshots attached there's something else along with the percentiles:
fio-3.15
"clat_ns": {
"min" : 48556,
"max" : 1912808,
"mean", " 77815.674575,
"stddev" : 24540.472938,
fio-3.19
"clat_ns": {
"min" : 12299,
"max" : 27050538,
"mean", " 79695.065751,
"stddev" : 140594.520868,
It's curious that fio-3.19 has a much lower minimum as well as a higher maximum (which obviously makes the standard deviation much worse)...
@akoundal Can you ensure you set aside a 3.15 and a 3.19 build (so you can switch between them at will). Then can you also set a fixed seed value (e.g. --seed=1 --allrandrepeat=1
) so things remain deterministic when you compare 3.19 to 3.15?
@sitsofe @axboe confirmed that ver 3.18 shows the same issue as v3.19 its showing the same high std deviation v3.18 clat_ns
@sitsofe yes I can set it aside and its on my dev machine currently. @axboe i have started the run with the v3.17 which will follow with v3.16 , then we are done with poor man way to confirm which was the starting version which shows the issue. Hopefully we will have this answer by Sunday :)
@sitsofe @axboe 3.17 shows the same latency issue. I am starting the 3.16 1 day run and will report back the numbers. once we have than we will surely know the starting version which is showing the issue .
@sitsofe @axboe I have completed the longer runs and confirmed that v3.16 is working as expected. So as per my 1 week of effort and runs, i can say that v3.17 is the culprit version which shows the unexpected latency numbers. I will try to do the git bisect on v3.16 vs v3.17
@akoundal I've browsed the commit titles between 3.16 and 3.17 and nothing immediately stands out... Let's see what your git bisect narrows the problem commit to!
@sitsofe Did the bisect again and unfortunately it doesn't give me anything for the quick test runs :(. I agree the comments don't show anything which may be causing it but that is for sure that 3.17, 3.18, 3.19 show the same issue consistently at my end.
@sitsofe what else is required from my side?
@sitsofe I am also seeing that using the json+ output format with log_histogram is resulting in zero values being dumped into the log_histogram files, could this be the reason of something going weird in QoS reporting as it uses the IOPS in each bin for reporting QoS?
@akoundal We would need either a successful bisection which pointed to a single commit or steps that others can run that quickly and reliably reproduce the issue without a huge setup (e.g. makes the issue happen in a minute on even boring slow hard disks).
@sitsofe - Any update on this issue? Based on the debug done in #974 ?
@swetharajendiran did you see https://github.com/axboe/fio/issues/974#issuecomment-627208905 - did checking out the latest fio and reverting just that commit (or checking out that commit and reverting it) definitely solve your issue (i.e. with it you see the issue and without you don't)? Eyeballing, it seems unlikely...
@swetharajendiran ping?
@akoundal Something that has occurred to me regarding the higher values is that the histograms values could be corrupt before fio 3.17 (see #827 for the bug report, problem was introduced by https://github.com/axboe/fio/commit/6cc0e5aa9eddd2487dfa9ac80be1264151058409 in fio-3.6 and was fixed by https://github.com/axboe/fio/commit/1fb9250b81aa7c655cb62ab26647a6ad54a279d5 in fio 3.17). The corruption was based on unwritten data so it the impact would appear to be random. Maybe if the corruption was favourable then values would have looked abnormally low...
@sitsofe yes that is for sure that its before the 3.17 as i saw that 3.14 onwards it was busted for the runs I was doing Unfortunately if we say that fix is already there but I am still able to hit/see the issue in the newer releases. @swetharajendiran will be updating her story also soon and we should have some more data for your analysis we really want to move to the new version but this is blocking us to move and utilize the other goodness of new FIO version is offering.
@sitsofe - It was pretty evident with that particular commit, since I did 3 back to back runs and still hit the issue. I can do few more runs and will post the data in a couple of days.
@sitsofe I tried to run the same workload again with the suspected bad commit and the good version. I am not able to reproduce the issue by just switching to the suspected bad commit.
While running a given workload, this is just an e.g. I am seeing way higher clat_ns values. The system , drive, OS all remains same there is no change, the only delta is the fio version .
e.g. FIO ver3.5
FIO ver3.19 / 3.15
This is the input file
my run command : fio --output-format=json+ --output rd_rnd_qd_1_4k_1w_summary.log
running on linux machine CentOS 8 , kernel 5.5-4-1
Let me know if i can provide any information and help in testing out some of the patches.
I tested with v3.10, v3.12 , v3.14 and they are not showing the higher latency.