axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.19k stars 1.26k forks source link

Suggestions for improvement of precision of wall-clock time measurements #695

Open AndreyNevolin opened 5 years ago

AndreyNevolin commented 5 years ago

Hi,

I borrowed some ideas - but not the code though - from fio for my own benchmarking tools. Namely, I was interested with measuring wall-clock time using time-stamp counters (or time-base registers, or whatever they're called on different architectures).

I'd like to give back some observations that - I believe - would allow to improve the precision of time measurements done by fio.

1) I believe, currently fio measures time intervals of several seconds with the precision of several hundred nanoseconds (which is pretty the same, or in many cases much worth than clock_gettime(); though, the overhead is indeed much lower in fio) 2) the loss of precision in fio stems from the following facts:

Taking all of the above into account, it's possible to achieve precision of dozens of nanoseconds (at the scale of second-long time periods). This is still not a pure nanosecond precision but is much better than multiple hundreds of nanoseconds per second.

One additional observation: on-the-fly conversion of ticks to nanoseconds in fio was designed to avoid integer divison (which is indeed really costly). But there is still one division operation used to store calculated nanoseconds in timespec format:

tp->tv_sec = nsecs / 1000000000ULL;

Not sure whether anything can be done about that.

I'm sorry for not proposing the patch. Unfortunately, I'm not a fio developer. Not even a fio user. Also I did all precision evaluations using my own code. For the purpose of fio development, evaluations based on fio itself would be more trusted.

Thank you for the great tool! Though I don't use it myself, performance engineers in my company do use it really intensively.

Hope, at least some of my observations will be useful.

sitsofe commented 5 years ago

Mentioning @vincentkfu for comment as he was the inventor/progenitor of the current fio timing work...

sitsofe commented 5 years ago

@AndreyNevolin it's interesting to know that (Dell) EMC use fio internally. I'm always curious as to which tools different people are using at their work - if you're allowed can you name any other performance tools used?

(I know Andrey already knows about this but I'm going to link to the original discussion over in https://github.com/axboe/fio/pull/387 for the sake of others who may land on this discussion)

vincentkfu commented 5 years ago

@AndreyNevolin, thank you for these suggestions.

The proposed change to get_cycles_per_msec() definitely seems like an improvement.

The fio nanosecond patches actually included a change from measuring ticks-per-usec to measuring ticks-per-msec for improved accuracy. So it makes sense that using ticks-per-sec would be even better.

If no one beats me to it, I will investigate implementing your suggestions in fio.

axboe commented 5 years ago

Vincent, please do, seems useful to me.

FWIW, I also agree that the division by 1000000000 is troublesome, it's a big cycler waster. I did briefly yesterday look into doing this smarter. For the seconds, came up with this one:

seconds = (nsecs * 2199) >> 41;

but I'm still missing a good way to do get the remainder. Ideas welcome. The compiler is smart enough to fold the MOD and DIV into one thing, so we need to kill both to actually speed up this part.

axboe commented 5 years ago

Something like this would most likely be faster:

-               tp->tv_sec = nsecs / 1000000000ULL;
-               tp->tv_nsec = nsecs % 1000000000ULL;
+               secs = (nsecs * 0x897ULL) >> 41;
+               tp->tv_sec = secs;
+               tp->tv_nsec = nsecs - (secs * 1000000000ULL);
+               if (tp->tv_nsec >= 1000000000ULL) {
+                       tp->tv_sec++;
+                       tp->tv_nsec -= 1000000000ULL;
+               }
axboe commented 5 years ago

Looking at the generated code, there's actually no division in there. It's all shifts and multiplications, since it's divide by a constant. That said, I ran the above, and it does appear to be a little faster. Would be nice if others can test too.

AndreyNevolin commented 5 years ago

@AndreyNevolin it's interesting to know that (Dell) EMC use fio internally. I'm always curious as to which tools different people are using at their work - if you're allowed can you name any other performance tools used?

@sitsofe I'm not with DellEMC anymore. Sorry for the confusion. I need to fix my profile Currently I work for a small company that has its own server platform and also produces storage systems based on that platform. fio is indeed used widely in this company for production testing. But almost all big companies use home-grown tools for the production testing. There are quite a bit reasons for that. One of the biggest reasons - reproducibility. A lot of effort is put into that. Also every big company has its own performance evaluation methodology. The tools are closely tied to that methodology.

But R&D departments in big companies DO use open source tools widely and almost don't use home-grown tools. fio is a popular choice for block workloads. IOR and Mdtest are popular choices for distributed file systems. Mongoose (https://github.com/emc-mongoose/mongoose) is a good choice for object stores and conventional NAS stores

sitsofe commented 5 years ago

@AndreyNevolin Thanks for letting me know! I have hopes to one day write a tool survey so I'll stash your comments away (for example mongoose is new to me).