axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.23k stars 1.26k forks source link

Passing more than one profile definition file on command line results in incomplete reporting #91

Open szaydel opened 9 years ago

szaydel commented 9 years ago

I did not see other issues that seemed similar to this one, so I am opening it, in case this is in fact not just me doing something stupid but is a bug. I have not dug into the source enough to confirm either way.

What I observed is that when I run fio with more than one job definition file, when those files specify numjobs=X, where X > 1, all reporting appears correct for jobs in first file, but only the first job in the second file appears to be reported on. What is strange is that if I have say two files, let's assume it is actually one file and I repeat its name twice on the command line, like this: ENV_NUM_JOBS=4 ./fio --output-format=json ./reportingtest.fio ./reportingtest.fio > /tmp/log, the resulting JSON structure shows 8 jobs, 4 from each instance of the file passed to fio. This file is fairly basic:

[global]
include global-include.fio

[warmup_noop]
stonewall
bs=128k
rw=write
loops=1
time_based=1
runtime=5
numjobs=${ENV_NUM_JOBS}

This is a stupid test, that's all. What I get is all jobs in the first file are reported correctly, and ONLY the first job in the second file is reported correctly, but the other three are all zeroes across the board. Maybe I am indeed doing something stupid without realizing it, or perhaps it is a bug.

axboe commented 9 years ago

I'll try and reproduce this. What's in global-include.fio?

szaydel commented 9 years ago

This is it, thanks!

direct=0
directory=/storage/p01/test/01
filename_format=benchmark.$jobnum.$filenum
ioengine=solarisaio
fallocate=none
size=${ENV_SIZE}
iodepth=16
randseed=1277
szaydel commented 9 years ago

Upon doing a bit more testing this is starting to look like there some number of jobs that are reported correctly, and beyond that number of jobs other jobs may be reported as all zeroes. This could be a platform issue, I am sure, if this is not a problem that you are able to reproduce.

axboe commented 9 years ago

What ENV_SIZE are you running it with? You have a stonewall between the jobs and a short runtime, so it's likely that some jobs just never got to run. Hence they would report zero for the various metrics.

szaydel commented 9 years ago

This may be true, I have it set to 10GB, which with the drives that I have should be nothing at all. This is running over PCIe SSD-like things, which chew through 10GB in a couple of seconds, but you could well be right.

szaydel commented 9 years ago

I am going to test with various sizes and various periods of time, we'll see if that makes a difference. I am assuming if the time is set to 30 seconds, and say 4 parallel jobs are defined for each instance, the assumption is that all four will run in parallel for 30 seconds, correct?

Thank you!

szaydel commented 9 years ago

I tried with much smaller size and a longer window, and it seems that either there is in fact a problem, or some combination of stonewall and runtime are preventing some jobs from running. It is difficult to tell at the moment. I shall do more investigation of this.

szaydel commented 9 years ago

Tested without setting time limits and result appears same, first five jobs appear to be reported on correctly, but not remaining jobs. It may be something platform-specific, which I shall test as time permits.

szaydel commented 9 years ago

Maybe I should ask about the best way to accomplish what I am trying to accomplish, because what I am running into could be a direct result of using a method ill-conceived.

What is the best approach to making sure that each job defined in a file is repeated some number of times, while having a number of parallel threads?

I want each job to have say 4 threads executing in parallel and I want for each definition file to be processed multiple times. So, the goal is for each clone to report independently, and have a report that will have each instance of each run for each clone. So, say I have 4 jobs that I expect will have 4 clones run in parallel, repeated 4 times. This should result in 4 * 4 * 4 == 64 results in the final log file. I suspect what I am doing is not by any means ideal.

axboe commented 9 years ago

loops=X? Though that would not do separate reporting of the different loop iterations, just one for all of them. I think you should keep looking into what the issue is with what you are currently running, there's no reason that shouldn't work.

szaydel commented 9 years ago

Yeah, like you said loops=X does not delineate reporting, which is what I want so I can treat each line effectively as a datapoint. I shall keep looking. Thank you for confirming that what I am trying to do "should" work, this helps.

szaydel commented 9 years ago

What I believe I am seeing is that the first job definition is reported separately for each thread, as set by numjobs, but the others do not appear to follow suit, and it appears that they are reported as a group. So, when I have say three separate jobs defined in one file, the first one yields 4 individual reports, then the other two effectively report as individual groups. This results in 1 instead of 4 reports for the subsequent jobs.