distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
185 stars 108 forks source link

Pbench-fio reporting incorrect seq read IOPS in ver 0.44 #647

Open shekharberry opened 7 years ago

shekharberry commented 7 years ago

Hi,

I am running pbench-fio to capture perf numbers in my setup but seq read numbers reported by pbench-fio is way higher than expected. I have a 1Gb link and I don't expect a read throughput of more than 120MB/sec but I am getting ~760 MB/Sec. The write number reported by pbench-fio is 38MB/sec which is ok and within range.

I also did a fio run (fio job-file) on the same environment and got read throughput of 79 MB/sec and write throughput of 38MB/sec. These numbers are within range and explainable.

The read throughput exaggerates even further if I do a multi-client pbench-fio test.

Thanks and Regards Shekhar

bengland2 commented 7 years ago

@ekuric asked me to look at this. Looks like a cached-read result. look at your iostat data from Gluster servers to see if your block device I/O rates are consistent with the result you are getting. Similarly look at your sar network l2 data to see if that is consistent with it as well. With Gluster, dropping cache may not be enough if there is caching inside Gluster processes. Try unmounting & remounting if you are not sure, and drop cache on Gluster servers.
HTH -ben

shekharberry commented 7 years ago

I dropped cache on both server and client side before starting the pbench-fio test. To confirm the anomaly I confirmed the numbers from iostat and iostat does not report such high numbers. I can try unmounting and remounting for 1 application POD but I will be using pbench-fio to run concurrent IO's on say about 1000 vols where it would be difficult to unmount and remount.

bengland2 commented 7 years ago

@shekharberry, use sar network-l2 graph to figure out if data is actually coming from servers at that rate. Did you check that? Are you sure data is travelling over 1-GbE network and not 10-GbE? The combination of iostat and sar data is effective for detecting cached-read behavior in these situations. Finally you can post the fio --client JSON result file - what does it say? Is fio to blame or is pbench-fio not extracting data from fio result JSON correctly?

I got a parameter named --pre-iteration-script added to pbench-fio so that it could drop cache before every sample. pbench-fio will try to run multiple samples, even if you drop cache before you run pbench-fio, this won't help for samples 2 thru N. But if you use --pre-iteration-script parameter to run a cache-dropping script, then you can get the right results.

atheurer commented 7 years ago

I think you should be congratulated on what's probably the biggest pbench job ever 🥇 I didn't even expect a 1000-client fio job to complete.

There are a couple issue here:

1) it's very likely the containers are not actually running the test at the same time, due to fio having to tell 1000 different clients to start. You can see that handful of containers started with really high IOPS (thousands/sec), then ramped down to 2-3/sec. Containers that started later started at 2-3 iops and stayed there.

2) In order to generate iops-over-time graphs for the result, pbench looks at fio_iops.1.log for each client. This is a log file with timestamps and a iops value. There are a couple problems with this file, the primary one being that the timestamps always start with 0 instead of the actual time. So, we don't know when a particular client started/stopped in relation to the other clients. This is critically important because pbench will find the window of time where -all- clients were running and it will reject any data before or after, where only some where running. This is vital for a multi-system test where the underlying resource (in this case disks) is shared. So, we need fio to generate these files with real timestamps. The secondary problem with this file is that the iops value is always an integer. When dealing with 1000's of clients, a value of "2" or "3" is not accurate enough. We need something like at least 3-4 significant digits, so we need floating point values.

3) with a test being only 30 seconds, the actual time period where all the clients are active is probably very small. If it takes 15 seconds to get all 1000 clients running, then only 15 seconds remain where they all run together, after which clients complete for the next 15 seconds. The run time should also be much longer to help with consistency in general. I would run these tests at least 5 minutes each, and I would run 5 samples so you have std-dev with the result.

So, based on the data I see with this result, the clients are probably doing 2-3 IOPS each, so 2000-3000 IOPS total, and the gluster servers are supporting that with total-disk IOPS in the low-1000's range.

In summary to make this test useful, first need to fix fio to use real timestamps and floating point IOPS values, and the run again with 5 min runtimes and 5 samples.

bengland2 commented 7 years ago

On Tue, Sep 5, 2017 at 9:03 PM, Andrew Theurer notifications@github.com wrote:

I think you should be congratulated on what's probably the biggest pbench job ever 🥇 I didn't even expect a 1000-client fio job to complete.

There are a couple issue here:

1.

it's very likely the containers are not actually running the test at the same time, due to fio having to tell 1000 different clients to start. You can see that handful of containers started with really high IOPS (thousands/sec), then ramped down to 2-3/sec. Containers that started later started at 2-3 iops and stayed there.

This can be worked around with ramp_time and startdelay parameters. startdelay parameter gives all fio processes a chance to start running before workload is generated, and ramp_time parameter gives all of them a chance to participate in the test equally. And if you use a runtime parameter to limit the duration of the test for all threads, then you can get all threads running in basically the same time interval, which is the only way that you can aggregate the per-thread results to get a cluster-wide result.

Another technique is to introduce a delay between I/O requests so that 1 thread can't consume all system resources, which you can do in fio with thinktime= parameter. Let's say system throughput is X and there are T threads, where T >> 1. Then you can use a thinktime like ~T/4X to throttle threads without decreasing system throughput. Since most applications do something besides I/O, then there is some rationale for doing this.

1. 2.

In order to generate iops-over-time graphs for the result, pbench looks at fio_iops.1.log for each client. This is a log file with timestamps and a iops value. There are a couple problems with this file, the primary one being that the timestamps always start with 0 instead of the actual time. So, we don't know when a particular client started/stopped in relation to the other clients. This is critically important because pbench will find the window of time where -all- clients were running and it will reject any data before or after, where only some where running. This is vital for a multi-system test where the underlying resource (in this case disks) is shared. So, we need fio to generate these files with real timestamps. The secondary problem with this file is that the iops value is always an integer. When dealing with 1000's of clients, a value of "2" or "3" is not accurate enough. We need something like at least 3-4 significant digits, so we need floating point values.

I ran into this problem also with large numbers of guests. I did this upstream http://www.spinics.net/lists/fio/msg03862.html 2 years ago, does this work for you?

1.

with a test being only 30 seconds, the actual time period where all the clients are active is probably very small. If it takes 15 seconds to get all 1000 clients running, then only 15 seconds remain where they all run together, after which clients complete for the next 15 seconds. The run time should also be much longer to help with consistency in general. I would run these tests at least 5 minutes each, and I would run 5 samples so you have std-dev with the result.

+1

1.

So, based on the data I see with this result, the clients are probably doing 2-3 IOPS each, so 2000-3000 IOPS total, and the gluster servers are supporting that with total-disk IOPS in the low-1000's range.

In summary to make this test useful, first need to fix fio to use real timestamps and floating point IOPS values, and the run again with 5 min runtimes and 5 samples.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/distributed-system-analysis/pbench/issues/647#issuecomment-327344598, or mute the thread https://github.com/notifications/unsubscribe-auth/ABozcwsAUq04OeErg_2iYXDtyzIt6Sd2ks5sfe9TgaJpZM4O874L .

atheurer commented 7 years ago

@bengland2 the patch will definitely help, but we still need to fix the timestamps in the iops log file.

shekharberry commented 7 years ago

I ran into other setup related issues, did not get a chance to execute all the above suggestions. Will run IO test once I resolve the current issues and update findings here. Thanks for all your suggestions @bengland2 and @atheurer

shekharberry commented 7 years ago

Hi, Thanks all for your suggestions @bengland2 and @atheurer, I did following in my 500 fio client test:

-- Added a per-iteration script to drop cache on all servers

--Initially time_based was set to 1 which resulted in FIO looping over same workload and this gave very skewed result with write being very low and read being very high, hence changed the time_based=0. With time_based=0 pbench-fio could collect IOPS from 3-4 clients out of 500 as they were the only ones which were active together.

--Now since my dataset size is very less, only 6GB on each volumes, my test gets over very fast and to overcome the problem encountered above I reduced the ramp-time to 0 and introduced a startdelay of 30 sec to allow all process a chance to start before collecting data. I left runtime=300 secs as time_based=0 for me. But this test is not even completing for 500 clients. It get hung with following message: fio: inflate error -3 fio: failed decompressing log fio: failed converting IO log fio: inflate error -3 fio: failed decompressing log fio: failed converting IO log

-- my sample fio.job file [global] bs=64k runtime=300 ioengine=sync iodepth=32 direct=0 sync=0 time_based=0 clocksource=gettimeofday ramp_time=0 startdelay=30

[job-/mnt/glusterfs/x4] filename=/mnt/glusterfs/x4 rw=write size=6G write_bw_log=fio write_iops_log=fio write_lat_log=fio log_avg_msec=1000 write_hist_log=fio log_hist_msec=10000 log_hist_coarseness=4 # 76 bins

Any suggestions on what else should I try to overcome the above issues?

--Shekhar

shekharberry commented 7 years ago

Here's an update on my tests from today:

--Tried running test on 1 client with ramp_time 0 and startdelay=120 secs... it works fine... --Then repeated the test on 10 clients with ramp_time 0 and startdelay=120 secs... it works fine... --Then repeated the test on 100 clients with ramp_time 0 and startdelay=120 secs... it again gets stuck with fio: inflate error -5 fio: failed decompressing log fio: failed converting IO log after the first operation which was the write operation gets over.

Here is the pbench-fio command I used: pbench-fio --test-types="write,read,rw" --config="gfile_100pv_ramp0_startdelay120" --clients=$(cat 100cl | awk -vORS=, '{ print $1 }' | sed 's/,$/\n/') --targets=/mnt/glusterfs/x6 --ioengine=sync --direct=0 --numjobs=1 -b 64 -s 6G

And here is the job file for write: [global] bs=64k runtime=300 ioengine=sync iodepth=32 direct=0 sync=0 time_based=0 clocksource=gettimeofday ramp_time=0 startdelay=120

[job-/mnt/glusterfs/x6] filename=/mnt/glusterfs/x6 rw=write size=6G write_bw_log=fio write_iops_log=fio write_lat_log=fio log_avg_msec=1000 write_hist_log=fio log_hist_msec=10000 log_hist_coarseness=4 # 76 bins

The /opt/pbench-agent/bench-scripts/postprocess/process-iteration-samples script which processes the iteration doesn't seem to be working fine when number of clients increases.

--Shekhar

bengland2 commented 7 years ago

so this fio error above is exactly what I warned people about over a year ago. I could not figure out why fio was generating these errors, but I did figure out a workaround: stop having fio fetch the log files and use rpdcp to do it instead, it does it without error. Tim WIlkinson and I used this workaround for our 1000-guest tests in the scale lab. I can dig up the pbench-fio patch if you are interested. I have not tried to get it upstream so far.

bengland2 commented 7 years ago

@shekharberry , @atheurer , the fio patch which I used is applied to pbench-fio-2.14 fio/iolog.c around line 1098:

#if 0
        if (td->client_type == FIO_CLIENT_TYPE_GUI || is_backend)
                fio_send_iolog(td, log, log->filename);
        else
                flush_log(log, !td->o.per_job_logs);
#else
        flush_log(log, !td->o.per_job_logs);
#endif

Also see this patch to pbench-fio for more details on how I got pbench-fio to use rpdcp to retrieve the histogram logs.

HTH

-ben

shekharberry commented 7 years ago

@bengland2 Thanks for sharing your workaround. Will try it in my setup and update the findings here.

--Shekhar

ekuric commented 7 years ago

@bengland2 is this in some branch / PR already, ... thanks

ashishkamra commented 7 years ago

Ben and I just talked. It's still a patch not merged upstream. Needs some amount of work to get it upstream. I will call out a short meeting on how to make progress on that aspect.

Ashish

On Mon, Sep 25, 2017 at 12:45 PM, Elvir Kuric notifications@github.com wrote:

@bengland2 https://github.com/bengland2 is this in some branch / PR already, ... thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/distributed-system-analysis/pbench/issues/647#issuecomment-331941612, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUYh-Gov2URHg-nCRl2YHuEL311OPjRks5sl9i9gaJpZM4O874L .

-- Ashish Kamra Manager, OpenShift Performance and Scale akamra@redhat.com M: +1 (508) 834-2297 IRC: akamra

bengland2 commented 7 years ago

@ekuric no PR yet exists. The issue has been posted to the fio list here, but no solution has been proposed, and it is not clear to me that it would be accepted upstream. I suspect Jens wanted an fio implementation that could work for Windows as well as UNIX/Linux. My workaround is based on pdsh package, which probably does not exist on Windows.

shekharberry commented 7 years ago

Thanks Ben and Ashish for looking into it and setting up a call to discuss it.

Best Regards Shekhar

On Sep 25, 2017 10:32 PM, "Ashish Kamra" notifications@github.com wrote:

Ben and I just talked. It's still a patch not merged upstream. Needs some amount of work to get it upstream. I will call out a short meeting on how to make progress on that aspect.

Ashish

On Mon, Sep 25, 2017 at 12:45 PM, Elvir Kuric notifications@github.com wrote:

@bengland2 https://github.com/bengland2 is this in some branch / PR already, ... thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/distributed-system-analysis/pbench/issues/647# issuecomment-331941612, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUYh-Gov2URHg- nCRl2YHuEL311OPjRks5sl9i9gaJpZM4O874L .

-- Ashish Kamra Manager, OpenShift Performance and Scale akamra@redhat.com M: +1 (508) 834-2297 IRC: akamra

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/distributed-system-analysis/pbench/issues/647#issuecomment-331946237, or mute the thread https://github.com/notifications/unsubscribe-auth/ANPdheVUimYvHk7wBX6dTrUSYmvzHMOLks5sl9ycgaJpZM4O874L .

ashishkamra commented 6 years ago

Meeting notes:

  1. Verify turning off histogram logging from the fio template - without Ben's patch - Shekhar
  2. Remove histogram logging from the pbench fio template by default - pbench team
  3. Document in pbench that change the template to enable histogram logging if you really want to - pbench team
  4. If really need histogram logging for better data analysis, use Ben's patch in the short term
  5. Another option to get histogram logging is to disable client mode when running fio at large scale. We may lose some fio aggregate functionality but pbench data analysis tools may compensate for that.
  6. Need to make sure that the time stamps are correct in the log file - separate issue. Someone should own it. Does an issue exist in pbench for this?
  7. No owner identified yet to fix the send log issue in fio for sending histogram logs in client mode.
ashishkamra commented 6 years ago

@shekharberry can you update this card with your analysis?

ashishkamra commented 6 years ago

Refer PRs #746, #747, #748