fio wait_for with multiple jobs can introduce variable and significant delay between jobs

aggieNick02 commented 11 months ago

Please acknowledge the following before creating a ticket

[ X ] I have read the GitHub issues section of REPORTING-BUGS.

Description of the bug: I was using fio jobs with wait_for to switch between reads/writes and various block sizes over the course of a test. I was running the test with the default of logging every I/O.

I noticed a small (less than 1 second) delay between the I/O for each job. I wondered if it might be due to the specific device, so I changed the ioengine to null.

This made the delay huge - over 10 seconds between jobs after 4 seconds for each job's IO to null. This is pretty showstopper for using consecutive jobs in a fio script with the default logging, because the active vs idle time varies based on how the device performs in each job.

I realize this may not be trivial to fix, but creating an issue as I look into it more.

Environment: Ubuntu 20.04.4 LTS

fio version: 3.32

Reproduction steps fio --filename=/dev/nvme1n1 --per_job_logs=0 --write_bw_log=fio_0 --write_lat_log=fio_0 --write_iops_log=fio_0 --ioengine=null --direct=1 --randrepeat=0 --norandommap=1 --slat_percentiles=1 --lat_percentiles=1 --clat_percentiles=1 --log_offset=1 --log_compression=10M --log_store_compressed=1 --rw=rw --iodepth=8 --size=32G --log_alternate_epoch=1 --name=job_bs_512_write_True --rwmixwrite=100 --bs=512 --runtime=4 --time_based=1 --name=job_bs_512_write_False --wait_for=job_bs_512_write_True --rwmixwrite=0 --bs=512 --runtime=4 --time_based=1

aggieNick02 commented 11 months ago

I tried switching to the default per_job_logs=1, but that made no difference.

axboe commented 11 months ago

Quick guess - it's compressing and writing out the latency logs. This is also why it's slower when you use the null engine, as you'll have a lot more entries to process as it'll be much faster than doing actual IO to the device.

aggieNick02 commented 11 months ago

Thanks, I imagine that is what is going on. Turning off logs makes the delay go away, and I'm used to lots of I/O meaning a long wait when fio finishes running.

I'd really like all of the jobs to wait until every job is done to write the logs to disk, especially in this case where I'm using jobs as a mechanism to create a workload with different characteristics over time. Should I be thinking about a different way to run such a workload? I'll admit it does feel odd to be spawning new processes to vary the behavior of the workload.

I was going to look into what it would take to delay the writing until all of the jobs are done, I don't know the code well enough to know if such an option is trivial or complex to implement.

axboe commented 11 months ago

My only worry on delaying the compression and write back of the logs is that it'll potentially tie up increasingly more memory as you go. But outside of that, it should work. I don't think it'd be too hard to implement. td_writeout_logs() is called when a job exits, which is where that happens. The best way to handle this would likely be to have this be configurable, and if the option is set, the task would wait on a condition there before writing it out. When all jobs are done, iterate them and trigger that condition and wait for them to exit.

aggieNick02 commented 11 months ago

That makes sense on the memory usage. Just making sure I understand, this memory usage is just due to each job holding its log entries in memory until the end, right? I'm already used to keeping that in mind for longer-running jobs where the memory used by the log entries can grow quite large.

Having it as an option that needs to be enabled makes sense. Thank you for the pointers on implementing it. I think it will provide a nice capability for users that want a single conceptual job with varying characteristics over time. Thanks again!

axboe commented 11 months ago

Right, mostly just thinking in terms of memory usage from the logs by that job. There are other things too, as we'd be preventing the job from exiting (maybe? we might let it be able too, in which case it'd just be the log memory), but that's the grunt of it. But as long as that is adequately documented for this option, then that should be fine.

axboe / fio

fio wait_for with multiple jobs can introduce variable and significant delay between jobs #1649