NVIDIA / proxyfs

Apache License 2.0
64 stars 25 forks source link

Stuck on writing and reading files larger than 1 Mb #667

Closed SadeghTb closed 2 years ago

SadeghTb commented 2 years ago

after running fio benchmark with a 1Mb file size it works fine. but when I change the global size to ex: 10Mi it stuck. benchmark.fio

[global]
directory=/mnt
end_fsync=1
filename_format=fio.$jobnum.$filenum
group_reporting
iodepth=1
ioengine=psync
size=100Mi

[4KiB_write]
blocksize=4Ki
readwrite=write

[4KiB_read]
blocksize=4Ki
readwrite=read

I faced the below trace log when it stuck. iclient log

[2022-04-10T04:23:03.734579155Z][TRACE] <== rpcAdjustInodeTableEntryOpenCount(adjustInodeTableEntryOpenCountResponse: &{}, err: <nil>)
[2022-04-10T04:23:03.734680208Z][TRACE] <== DoOpen(openOut: &{FH:5407 OpenFlags:0 Padding:0}, errno: errno 0)
[2022-04-10T04:23:03.734747176Z][TRACE] <== DoGetXAttr(getXAttrOut: <nil>, errno: no data available)
[2022-04-10T04:23:03.734947844Z][TRACE] ==> DoGetAttr(inHeader: &{Len:56 OpCode:3 Unique:24 NodeID:5131 UID:0 GID:0 PID:2750 Padding:0}, getAttrIn: &{Flags:1 Dummy:0 FH:5407})
[2022-04-10T04:23:03.7350326Z][TRACE] <== DoGetAttr(getAttrOut: &{AttrValidSec:0 AttrValidNSec:250000000 Dummy:0 Attr:{Ino:5131 Size:104857600 Blocks:204800 ATimeSec:1649564583 MTimeSec:1649564583 CTimeSec:1649524097 ATimeNSec:722828167 MTimeNSec:722828167 CTimeNSec:561918465 Mode:33188 NLink:1 UID:0 GID:0 RDev:0 BlkSize:512 Padding:0}}, errno: errno 0)
[2022-04-10T04:23:03.735083518Z][TRACE] ==> DoWrite(inHeader: &{Len:4176 OpCode:16 Unique:26 NodeID:5131 UID:0 GID:0 PID:2749 Padding:0}, writeIn: &{FH:5404 Offset:4096 Size:4096: WriteFlags:0 LockOwner:0 Flags:32770 Padding:0 len(Data):4096})
[2022-04-10T04:23:03.735395006Z][TRACE] <== DoWrite(writeOut: &{Size:4096 Padding:0}, errno: errno 0)
[2022-04-10T04:23:03.735363175Z][TRACE] ==> DoRead(inHeader: &{Len:80 OpCode:15 Unique:28 NodeID:5131 UID:0 GID:0 PID:2750 Padding:0}, readIn: &{FH:5407 Offset:0 Size:16384 ReadFlags:0 LockOwner:0 Flags:32768 Padding:0})
[2022-04-10T04:23:03.735540696Z][TRACE] ==> DoGetXAttr(inHeader: &{Len:68 OpCode:22 Unique:30 NodeID:5131 UID:0 GID:0 PID:2749 Padding:0}, getXAttrIn: &{Size:0 Padding:0 Position:0 Padding2:0 Name:[115 101 99 117 114 105 116 121 46 99 97 112 97 98 105 108 105 116 121]})
SadeghTb commented 2 years ago

1 MiB sample result :

4KiB_write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
4KiB_read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 2 processes
4KiB_write: Laying out IO file (1 file / 1MiB)
4KiB_read: Laying out IO file (1 file / 1MiB)

4KiB_write: (groupid=0, jobs=2): err= 0: pid=1621: Sat Apr  9 17:06:41 2022
  read: IOPS=1753, BW=7014KiB/s (7182kB/s)(1024KiB/146msec)
    clat (nsec): min=1800, max=33275k, avg=552421.08, stdev=2065224.15
     lat (nsec): min=1869, max=33276k, avg=552658.18, stdev=2065271.34
    clat percentiles (usec):
     |  1.00th=[   46],  5.00th=[  227], 10.00th=[  273], 20.00th=[  306],
     | 30.00th=[  343], 40.00th=[  371], 50.00th=[  392], 60.00th=[  420],
     | 70.00th=[  453], 80.00th=[  490], 90.00th=[  570], 95.00th=[  693],
     | 99.00th=[ 1647], 99.50th=[ 2606], 99.90th=[33162], 99.95th=[33162],
     | 99.99th=[33162]
  write: IOPS=522, BW=2090KiB/s (2140kB/s)(1024KiB/490msec); 0 zone resets
    clat (usec): min=215, max=105894, avg=1387.33, stdev=9403.43
     lat (usec): min=215, max=105895, avg=1387.69, stdev=9403.47
    clat percentiles (usec):
     |  1.00th=[   231],  5.00th=[   281], 10.00th=[   289], 20.00th=[   322],
     | 30.00th=[   338], 40.00th=[   367], 50.00th=[   388], 60.00th=[   416],
     | 70.00th=[   437], 80.00th=[   482], 90.00th=[   578], 95.00th=[   840],
     | 99.00th=[ 33817], 99.50th=[103285], 99.90th=[105382], 99.95th=[105382],
     | 99.99th=[105382]
  lat (usec)   : 2=0.20%, 10=0.20%, 50=0.20%, 250=3.91%, 500=77.73%
  lat (usec)   : 750=12.89%, 1000=1.56%
  lat (msec)   : 2=1.76%, 4=0.78%, 50=0.39%, 250=0.39%
  cpu          : usr=0.00%, sys=4.42%, ctx=1112, majf=0, minf=24
  IO depths    : 1=100.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=256,256,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=7014KiB/s (7182kB/s), 7014KiB/s-7014KiB/s (7182kB/s-7182kB/s), io=1024KiB (1049kB), run=146-146msec
  WRITE: bw=2090KiB/s (2140kB/s), 2090KiB/s-2090KiB/s (2140kB/s-2140kB/s), io=1024KiB (1049kB), run=490-490msec

10MiB sample result:

4KiB_write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
4KiB_read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 2 processes
4KiB_write: Laying out IO file (1 file / 10MiB)
4KiB_read: Laying out IO file (1 file / 10MiB)
^C^C
fio: terminating on signal 2

Run status group 0 (all jobs):
edmc-ss commented 2 years ago

Nice detail in the issue, SadeghTb. I believe the published fio.conf file you specify 100Mi size, rather than the 1Mi or 10Mi, correct?

I ran with each of the size= settings of 1Mi, 10Mi, and 100Mi... and it worked, in my container setup just fine. I will say that, running in the containerized environment (currently), the lowly 10MiB/sec does mean the test pauses for a bit while it lays out the 100MiB file... and then takes 10ish seconds to complete:

=====

4KiB_write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1 4KiB_read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1 fio-3.28 Starting 2 processes 4KiB_write: Laying out IO file (1 file / 100MiB) 4KiBread: Laying out IO file (1 file / 100MiB) Jobs: 2 (f=2): [W(1),R(1)][15.0%][r=11.7MiB/s,w=6956KiB/s][r=2996,w=1739 IOPS][eJobs: 2 (f=2): [W(1),R(1)][23.5%][r=9722KiB/s,w=8247KiB/s][r=2430,w=2061 IOPS][eJobs: 2 (f=2): [W(1),R(1)][31.2%][r=13.0MiB/s,w=7492KiB/s][r=3323,w=1873 IOPS][eJobs: 2 (f=2): [W(1),R(1)][40.0%][r=10.3MiB/s,w=7808KiB/s][r=2641,w=1952 IOPS][eJobs: 2 (f=2): [W(1),R(1)][46.7%][r=9100KiB/s,w=7480KiB/s][r=2275,w=1870 IOPS][eJobs: 2 (f=2): [W(1),R(1)][53.3%][r=9502KiB/s,w=7272KiB/s][r=2375,w=1818 IOPS][eJobs: 2 (f=2): [W(1),R(1)][60.0%][r=12.9MiB/s,w=7480KiB/s][r=3291,w=1870 IOPS][eJobs: 1 (f=1): [W(1),(1)][66.7%][r=12.4MiB/s,w=7444KiB/s][r=3173,w=1861 IOPS][eJobs: 1 (f=1): [W(1),(1)][78.6%][w=9285KiB/s][w=2321 IOPS][eta 00m:03s] Jobs: 1 (f=1): [W(1),(1)][100.0%][w=8960KiB/s][w=2240 IOPS][eta 00m:00s] 4KiB_write: (groupid=0, jobs=2): err= 0: pid=3191: Sun Apr 10 06:03:07 2022 read: IOPS=2874, BW=11.2MiB/s (11.8MB/s)(100MiB/8906msec) clat (nsec): min=474, max=83337k, avg=346269.03, stdev=1326771.80 lat (nsec): min=497, max=83338k, avg=346427.62, stdev=1326782.20 clat percentiles (nsec): | 1.00th=[ 498], 5.00th=[ 516], 10.00th=[ 540], | 20.00th=[ 612], 30.00th=[ 31616], 40.00th=[ 220160], | 50.00th=[ 309248], 60.00th=[ 378880], 70.00th=[ 436224], | 80.00th=[ 493568], 90.00th=[ 561152], 95.00th=[ 643072], | 99.00th=[ 1011712], 99.50th=[ 1843200], 99.90th=[11599872], | 99.95th=[12255232], 99.99th=[71827456] bw ( KiB/s): min= 8495, max=15032, per=98.49%, avg=11324.24, stdev=1948.10, samples=17 iops : min= 2123, max= 3758, avg=2831.06, stdev=487.02, samples=17 write: IOPS=1989, BW=7957KiB/s (8148kB/s)(100MiB/12869msec); 0 zone resets clat (usec): min=64, max=83442, avg=496.97, stdev=1562.39 lat (usec): min=64, max=83442, avg=497.19, stdev=1562.39 clat percentiles (usec): | 1.00th=[ 137], 5.00th=[ 223], 10.00th=[ 269], 20.00th=[ 326], | 30.00th=[ 367], 40.00th=[ 396], 50.00th=[ 416], 60.00th=[ 445], | 70.00th=[ 474], 80.00th=[ 515], 90.00th=[ 578], 95.00th=[ 652], | 99.00th=[ 1090], 99.50th=[ 2073], 99.90th=[11863], 99.95th=[12911], | 99.99th=[79168] bw ( KiB/s): min= 6432, max=10416, per=100.00%, avg=7987.68, stdev=1022.27, samples=25 iops : min= 1608, max= 2604, avg=1996.84, stdev=255.54, samples=25 lat (nsec) : 500=0.83%, 750=11.23%, 1000=0.15% lat (usec) : 2=1.09%, 4=0.05%, 10=0.75%, 20=0.34%, 50=0.80% lat (usec) : 100=0.60%, 250=9.54%, 500=53.92%, 750=18.02%, 1000=1.57% lat (msec) : 2=0.60%, 4=0.07%, 10=0.02%, 20=0.38%, 100=0.03% cpu : usr=0.94%, sys=9.84%, ctx=84140, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=25600,25600,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs): READ: bw=11.2MiB/s (11.8MB/s), 11.2MiB/s-11.2MiB/s (11.8MB/s-11.8MB/s), io=100MiB (105MB), run=8906-8906msec WRITE: bw=7957KiB/s (8148kB/s), 7957KiB/s-7957KiB/s (8148kB/s-8148kB/s), io=100MiB (105MB), run=12869-12869msec

=====

When I've seen hangs like this in the past, it's always boiled down to one of two things:

1) I've got some "lock" I've neglected to release whereupon a subsequent attempt to re-acquire the lock hangs...

2) I've run out of disk space in my Swift instance...

To get to the bottom of #1, just dumping the stack trace for all Go Routines is sufficient to nail the culprit. #2 is exposed with df commands on the Swift Object Server(s) of course.

As you've probably seen, ProxyFS is composed of two or three components:

ickpt - this guy works around the "eventual consistency" of Swift Object PUT behavior in that the CheckPoint Record is PUT to Object "0000000000000000" each time. Thus, a subsequent GET (upon restarting) is not guaranteed to GET the last CheckPoint Record that was PUT. You can certainly run w/out ickpt - I do this all the time in test environments where the Eventual Consistency of Swift is very unlikely to bite us (i.e. since there are no failures of Swift nodes during the test sequence).

imgr - this is the "centralized" metadata server. Well, really, it just is the entity recording updates to the Inode Table... a simple mapping from Inode Number to an Object Number in the Swift Container. It also is responsible for tracking "leases" (distributed lock manager) and open/close references. The only Object it PUTs are the CheckPoint Record Object and the B+Tree "pages" that get modified when the Inode Table is updated.

iclient - this is the FUSE file system driver that really does all the heavy lifting of modifying the Objects making up the file system elements (i.e. directories, files, and symlinks).

BTW - cool that you enabled tracing. You can do that on the imgr side as well...kind of paints both ends of the RPC echanges between iclient & imgr that way.

I'll keep trying to reproduce your issue over the next couple of days... but any further info you might provide would be extremely helpful. One such critical piece of info would be the commit tag you are running. Here's mine (this is currently development/HEAD:

$ git describe 2.01.0-10-gb00f4d7f

I suppose you might be using the HEAD of stable:

$ git describe 2.01.0

I haven't tested that one but will endeavor to do so at my end. There have certainly been improvements in the 10 commits on the development branch. Frankly, we are nearly done getting the portion of the standard PJD test suite (https://github.com/pjd/pjdfstest) that is pertinent to the feature set of ProxyFS (i.e. only dir, file, and symlinks... no atime, no sticky/setuid/setgit, etc...). The intention was to do a 2.02.00 tagged release once we get to that point...probably this upcoming week.

edmc-ss commented 2 years ago

I did just verify that, in my containerized environment at least, all three size= values succeed on stable/HEAD (2.01.00).

SadeghTb commented 2 years ago

Thank you so much for your complete answer, edmc-ss. yeah, there was not an issue with proxyfs and the problem was my swift that its backend storage got full and It had a capacity of only a few megabytes.