Same timestamps for multiple entries in 'latency logs'

nasirkamal commented 5 years ago

While working with latency logs, I observed multiple entries with same timestamp. Number of repetitive timestamp entries varies.

309725, 3612298256, 1, 0 310587, 3538155516, 1, 0 311182, 835696049, 1, 0 312291, 593198713, 1, 0 314225, 1116835033, 1, 0 314225, 1902415217, 1, 0 315700, 1903528133, 1, 0 317411, 1421335648, 1, 0 317411, 1676809984, 1, 0 318621, 1578017461, 1, 0 320116, 1200123359, 1, 0 320116, 1462259580, 1, 0 327024, 1653066442, 1, 0 327024, 6876324255, 1, 0 327024, 6877526343, 1, 0 327024, 6878650021, 1, 0 327024, 6879727089, 1, 0 327024, 6880711130, 1, 0 327024, 6881841308, 1, 0 328395, 6681780841, 1, 0 331114, 1391743820, 1, 0 331114, 2687151087, 1, 0 331114, 2688291232, 1, 0 332402, 2656141389, 1, 0 334096, 1245083946, 1, 0 334096, 1656856163, 1, 0 335398, 1661015181, 1, 0 336640, 1246076285, 1, 0 338329, 1236925346, 1, 0 338329, 1648589305, 1, 0 340386, 1680996867, 1, 0 340386, 2022739236, 1, 0 341779, 1956474643, 1, 0

For work load generation I used CBT (Ceph Benchmarking Tool), which uses parallel ssh to create and mount ceph block devices and then spawns multiple instances of fio to write or read from those block devices. In that specific case I used 160 clients for load generation. Fio commands being sent were like:

sudo fio --ioengine=rbd --clientname=admin --pool=cbt-librbdfio --rbdname=cbt-librbdfio-client$i --invalidate=0 --rw=write --output-format=json --runtime=600 --ramp_time=90 --numjobs=1 --direct=1 --bs=4096B --iodepth=32 --write_iops_log=client$i --write_bw_log=client$i --write_lat_log=client$i --log_avg_msec=1000

$i represents variable.

I'm not sure how to interpret those multiple (same timestamped) entries.

CBT workload profile is given below:

cluster: user: 'cbt' head: "ceph-admin" clients: ["loadgen0", "loadgen1", "loadgen2", "loadgen3", "loadgen4", "loadgen5", "loadgen6", "loadgen7"] conf_file: "/etc/ceph/ceph.conf" osds: ["hstor0", "hstor1", "hstor2", "hstor3"] mon: ["hstor0", "hstor1", "hstor2"] use_existing: True osds_per_node: 2 iterations: 1 rebuild_every_test: True tmp_dir: "/home/cbt/cbt" clusterid: "ceph" pool_profiles: rbd2rep: pg_size: 2048 pgp_size: 2048 replication: 2

benchmarks: librbdfio: iterations: 1 clientname: admin time: 600 ramp: 90 iodepth: [32] numjobs: 1 mode: [randread, randwrite, randrw] ioengine: libaio rwmixread: 70 op_size: [4096] vol_size: 10240 direct: 1 buffered: 0 use_dir: True output-format: json volumes_per_client: 20 fio_cmd: '/usr/bin/fio' pool_profile: rbd2rep bwavgtime: 1000 log_avg_msec: 1000

sitsofe commented 5 years ago

I think that timestamp corresponds to when the sample was added and by definition that will be after the latency is calculated...

https://github.com/axboe/fio/blob/3a294b8704a4125f12e3c3dec36667e68d821be0/stat.c#L2634 leads to https://github.com/axboe/fio/blob/3a294b8704a4125f12e3c3dec36667e68d821be0/stat.c#L2481 which leads to https://github.com/axboe/fio/blob/3a294b8704a4125f12e3c3dec36667e68d821be0/stat.c#L2457 which gets to https://github.com/axboe/fio/blob/3a294b8704a4125f12e3c3dec36667e68d821be0/stat.c#L2339 .

t in __add_log_sample() comes from elapsed in __add_stat_to_log() which is set from the elapsed passed in by add_log_sample(). In add_log_sample() elapsed is calculated - see https://github.com/axboe/fio/blob/3a294b8704a4125f12e3c3dec36667e68d821be0/stat.c#L2451 . Given the calculation of time is being done in milliseconds I can well believe a machine would be fast enough to have multiple entries within the same millisecond.

@nasirkamal if this answers your question could you close this issue? Thanks!

sitsofe commented 5 years ago

Closing due to lack of reply from reporter. If this issue is still happening with the latest fio (see https://github.com/axboe/fio/releases to find out which version that is) please reopen. Thanks!

ndokos commented 4 years ago

@sitsofe I would like to reopen this issue.

I do a simple run with:

 /usr/local/bin/fio  --output-format=json ./fio.job

The version (/usr/local/bin/fio --version) is fio-3.18 and the job file looks like this:

[global]
bs=4k
runtime=60
ioengine=libaio
iodepth=32
direct=0
sync=0
time_based=1
clocksource=gettimeofday
ramp_time=5
write_bw_log=fio
write_iops_log=fio
write_lat_log=fio
log_avg_msec=1000
write_hist_log=fio
log_hist_msec=10000

[job-/home/pbench/tmp/foo]
filename=/home/pbench/tmp/foo
rw=rw
size=4096M
numjobs=4

IIUC, the log_avg_msec setting will produce as output the average latency over a 1-sec window, with a timestamp just after the end of the window. And since I'm averaging over one-second windows, I expect the timestamps to be increasing by roughly 1 sec between entries.

But that's not what I see: not only do I get entries that are much closer than 1 sec (see timestamps 29000 and 29040 below), but I also see multiple entries of the same type with identical timestamps. Here's e.g.. a portion of fio_lat.1.log:

...
29000, 443121, 0, 0, 0
29000, 472570, 1, 0, 0
29040, 76267, 0, 0, 0
29040, 59780, 1, 0, 0
30033, 506591, 0, 0, 0
30033, 622015, 1, 0, 0
30033, 240080000, 0, 0, 0
31035, 512569, 0, 0, 0
31035, 466661, 1, 0, 0
31035, 77109000, 0, 0, 0
32093, 502814, 0, 0, 0
32093, 440407, 1, 0, 0
32093, 228134000, 1, 0, 0
33111, 585750, 0, 0, 0
33111, 557852, 1, 0, 0
33111, 116083000, 0, 0, 0
33111, 116084000, 1, 0, 0
34039, 490727, 0, 0, 0
34039, 533947, 1, 0, 0
34039, 44131000, 0, 0, 0
34039, 44133000, 1, 0, 0
35000, 333857, 0, 0, 0
35000, 338863, 1, 0, 0
35061, 92974, 0, 0, 0
35061, 74590, 1, 0, 0
...

In some cases (e.g. for the 29040 timestamp) I get two entries, one for read and one for write: that's what I expect. But in other cases (e.g for timestamp 30033) I get three entries: there are two entries for type 0 (with the avg latency off by 3 orders of magnitude between them). In others (e.g. for timestamp 33111), I get four entries, two for type 0 and two for type 1, also with very different avg latency values.

Two questions: 1) Is this OK or is it a bug? 2) If it is OK, how do I interpret these results?

I should mention that this happens with all latency logs (lat, clat, slat). This particular run was done on a RHEL8 VM, but I have seen similar results on RHEL7 VMs and a Fedora31 laptop (with an SSD).

sitsofe commented 4 years ago

@ndokos:

Could you open a new issue for this? While it might look related your issue has differences (e.g. you're averaging)...

1) Is this OK or is it a bug?

30033, 506591, 0, 0, 0
30033, 622015, 1, 0, 0
30033, 240080000, 0, 0, 0

That doesn't look normal. However, can you guarantee that time is not moving backwards (e.g. you're not running NTP or anything that will fiddle with the clock)? I notice you're using clocksource=gettimeofday which opens you up to that so it would be good to know if you can use one of the other timers that are at least monotonic.

2) If it is OK, how do I interpret these results?

I don't think you can at the moment but it's key that time is only moving forwards. If it's "jumping" then all bets are off...

ndokos commented 4 years ago

Will do - thanks! I will do a run with clock_gettime first and if I get the same problem, I'll open a new issue.

axboe / fio

Same timestamps for multiple entries in 'latency logs' #739