axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.17k stars 1.25k forks source link

FIO is opening too many files for verify workloads #520

Open chamarthy opened 6 years ago

chamarthy commented 6 years ago

I have observed that for workloads where large number of files are being exercised (Ex: nrfiles=10000) with limited openfiles at any point of time (openfiles=16), FIO failing with error "Too many open files". With the following job file, i would expect that FIO will keep open only 16 files at any point of time. When i observed lsof for fio, i see that after sometime the open files count increases drastically and fails.

# while true; do lsof | grep fio | wc -l; sleep 1; done
145
146
145
146
145
145
145
145
146
145
145
145
133
132
132
389
1110

With increased ulimit for open files: it touches to >5000 at times (And sometimes even >20000)

146
146
131
709
1407
2084
2731
3425
4100
4791
5459
146
146

Job File:

[global]
ioengine=libaio
direct=0
size=14G
fsync=8
fsync_on_close=1
end_fsync=1
create_serialize=1
refill_buffers=1
verify=md5
numjobs=1
readwrite=rw
iodepth=16
bs_unaligned=1
time_based=1
create_on_open=1
filesize=8k-512k
do_verify=1
openfiles=16
bsrange=1k-8k
runtime=1800
nrfiles=10000
[job1]
directory=/root/fileio

Just wanted to check if there is any issue with the job profile or would it be some issue with FIO while open/close of files.

FIO Version: fio-3.3-31-gca65

on CentOS7.4.

sitsofe commented 6 years ago

@chamarthy at first glance this sounds like a bug. Does fio immediately fail or is it something that it builds up to?

chamarthy commented 6 years ago

@sitsofe It depends on ulimit settings for open files. On configuration where open files is set to low, FIO fails with error "too many open files". Otherwise, on executions where there are lot of files to operate upon, there are too many open files by FIO (sometimes >20000).

sitsofe commented 6 years ago

OK the problem is reproducible with the following:

$ mkdir /tmp/lotsoffiles
$ cat <<EOF | fio - 
[global]
thread
ioengine=libaio
size=20M
create_serialize=1
verify=md5
readwrite=rw
bs_unaligned=1
time_based=1
create_on_open=1
filesize=8k-512k
openfiles=2
bsrange=1k-8k
runtime=60
nrfiles=30
verify_state_save=0
[job1]
directory=/tmp/lotsoffiles
EOF

job1: (g=0): rw=rw, bs=(R) 1024B-8192B, (W) 1024B-8192B, (T) 1024B-8192B, ioengine=libaio, iodepth=1
fio-3.5-13-g8128
Starting 1 thread
fio: try reducing/setting openfiles (failed at 17 of 30)
fio: pid=7420, err=24/file:filesetup.c:703, func=open(/tmp/lotsoffiles/job1.0.28), error=Too many open files

The problem seems linked to the fact that get_next_verify() can open additional files but isn't checking whether td->nr_open_files >= td->o.open_files before doing so (and returning that we're too busy so we try again later). It's worth noting that td_io_open_file() which is called from get_next_verify() will increment td->nr_open_files...

Additionally it looks like if things go wrong in td_io_open_file() we don't decrement nr_open_files if we goto err.