Open onnozweers opened 1 year ago
We're running 8.2.24.
Looking at the graphs, the queueing seems to have started when we upgraded from 8.2.16 to 8.2.24, at the end of June. Java remained at version 11.0.18.
This looks like it could be related to a8a6912fdb @DmitryLitvintsev . The commit was added with 8.2.23.
Hi Onno,
Does it cause problems to the users or you think is is causing problems?
Here is the description of the patch
pnfsmanager: process listings of the same path sequentially, implement list
Motivation:
Multiple batch jobs executing listings of the same directories especially
if directories contain many entries quickly exhaust active threads, each thr
executing the same listing, denying listing to all other customers and
creating significant load on namespace server (and DB). Clients time out
and retry creating even more load on dCache.
Modification:
The problem is solved by implementing PnfsListDirectoryMessage folding
and dispatching list requests to multiple queues (number of queues is
configurable) based on directory path hashCode. Each queue is served by a si
thread. This guarantees that a listing of the same path will be queued to
the same queue and only one listing of a unique path can be active.
Once listing is complete, the queue is scanned for all other requests for
the same path and these messages are folded (that is their payload is
populated from just completed listing) and routed back to sender without
ever hitting the database.
Ancillary changes include - instead of hardcoded time in the future when
quota and FS stat scan start, use already available settings for durations
between run. This is to avoid triggering these scans when PnfsManager is
restarted often (when testing for instance). And a few "final" qualifiers ad
where suggested by IDE.
Result:
Lists of large directories no longer hog PnfsManage denying service to all
other list calls. Additionally observed significnat reduction of CPU and
IO load on chimera DB host. Patch has been running in production at Fermilab
and has shown to solve the issue of listings timing out.
patch: https://rb.dcache.org/r/13992/
The way it works is when you have an active listing and and another listing for the same directory queued - the queued ones get result of the active once it has completed. So there is parallelism.
It works very well at Fermilab and in fact fixed a lot of our listing hangs (when multiple large directories are listed at the same time)
I use:
pnfsmanager.limits.list-threads = 33
Try that :).
This is how this typically looks on our system:
QUEUED REQUESTS
SOURCE USER PATH
GFTP3-fndca4b-AAYBUPdNoBg@gridftp3-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP9-fndca4b-AAYBUPddRAA@gridftp9-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP10-fndca4b-AAYBUPdf58g@gridftp10-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP4-fndca4b-AAYBUPdxlyA@gridftp4-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox/0/1/1
Xrootd01-fndca4b@xrootd01-fndca4bDomain <unknown> /pnfs/fnal.gov/usr/uboone/persistent/stash/uboonebeam/beam_data/bnb_data/00/02/44/83
GFTP1-fndca4b-AAYBUPeJhCA@gridftp1-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP4-fndca4b-AAYBUPeOv_g@gridftp4-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP9-fndca4b-AAYBUPeY9UA@gridftp9-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP8-fndca4b-AAYBUPe9HxA@gridftp8-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP4-fndca4b-AAYBUPgBhsg@gridftp4-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP8-fndca4b-AAYBUPgmmvg@gridftp8-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP0-fndca4b-AAYBUPgwP7g@gridftp0-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP3-fndca4b-AAYBUPhFMwA@gridftp3-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP6-fndca4b-AAYBUPhFrBg@gridftp6-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP5-fndca4b-AAYBUPhK4CA@gridftp5-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP6-fndca4b-AAYBUPhPHhA@gridftp6-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP8-fndca4b-AAYBUPhkFUA@gridftp8-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
Xrootd01-fndca4b@xrootd01-fndca4bDomain <unknown> /pnfs/fnal.gov/usr/minerva/persistent/stash/pidDB/data/00016730
WebDAV00-fndcadoor02@webdav-fndcadoor02Domain uboonepro /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/data/uboone/root-tuple/38
WebDAV00-fndcadoor03@webdav-fndcadoor03Domain uboonepro /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/overlay/uboone/root-tuple/2
ACTIVE REQUESTS
SOURCE USER PATH
WebDAV-fndca4b-2@webdavDomain1 131.225.240.62 /pnfs/fnal.gov/usr/mu2e/resilient/jobsub_stage
WebDAV00-fndcadoor02@webdav-fndcadoor02Domain mnieslon /pnfs/fnal.gov/usr/annie/persistent/users/moflaher/g4dirt_vincentsgenie/BNB_Water_10k_22-05-17
WebDAV00-fndcadoor02@webdav-fndcadoor02Domain uboonepro /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/data/uboone/root-tuple/68
In the above:
listing of /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
is queued but once first of them completes - all of them complete immediately.
Another comment. The patch did not make it so that only one listing is running. The algorithm is as follows. There is a list of threads and the listing of path is dispatched to a thread number defined by:
N = hashcode(path) % pnfsmanager.limits.list-threads
So it is not that there is always just one listing thread running. Yes, there could be situations where large directory has the same hashcode as small directory, and in this case small directory would have to wait for large to be completed. But as I said we have seen very tangible improvement at Fermilab.
If this continue to give you trouble we may make this functionality optional (yet another variable)
Also... one thing to note. I have been doing the same kind of watching of "show list activity" and I must say that this is a little deceptive. Particularly that ACTIVE listing. I have noticed that you only "catch" listings in ACTIVE if they last a few seconds.
E.g. if you think there is only one listing running - try timing a listing while you see one ACTIVE. In my experience you get result right away and never see it being active. You need to use any protocol other than NFS that directly queries the database bypassing PnfsManager (I casually use uberftp ls for this sort of stuff).
P.S.: we generally see more queueing than before but we do not see one user causing DOS by running multiple ls commands (via batch jobs) on large directories (those would starve all available allowed threads). Behavior now - if such a user hits the system his ls requests will queue up until first completes meanwhile giving other user ls requests on other directories a chance to run simultaneously (on paths whose hashcode != hashcode of that users's directory). Once his requests completes all his queued ls requests complete without hitting the database with the result of just completed request. And on that goes. Anyhow, I would like to hear what your users experience. The ever present ls queue should not be a problem. At least it is not a problem for us. We do not have millions of entries per directory. Max about 200K. So I am eager to hear if there are side effects.
[I removed this comment because of the next comment]
I missed this part:
Another comment. The patch did not make it so that only one listing is running. The algorithm is as follows. There is a list of threads and the listing of path is dispatched to a thread number defined by:
N = hashcode(path) % pnfsmanager.limits.list-threads
So, it is possible that when there is a listing that lasts an hour, all listings assigned to the same thread have to wait, while other threads are doing nothing? Actually this does not make sense to me.
This is much better than all threads busy processing N listings of 1 hour not letting anybody in at al. (this is what we had to deal with at Fermilab and hence the patch). With this patch others have a good chance and only 1 out of N long list queries will be actually run on DB. "Things are better than they look"
Besides you observing queuing in your monitoring, do you have customers complaining. This is what the important bit is.
Other things we use:
pnfsmanager.limits.list-chunk-size = 1000
(I think default is 100)
And raise number of list threads higher.
P.S.: Listing query lasting longer than PnfsManager timeout would trigger timeout error in directory list handler (so 1 hour listing is not really possible)
Users haven't complained yet, but I'm sure they will if this continues. I added this metric to our monitoring because users complained before (when we had 2 threads, the default).
I would be very happy if this were configurable, so that we can return to the old algorithm.
Well.... give it a try with the increased list thread count and chunk size. My bet - they won't complain. Based on our experience.
If they don't complain - we won't make it optional :)
Sorry, whatever tuning may bring, I still think it's ugly. You have turned directory listing performance into a game of Russian roulette.
Hi Onno,
I respect your opinion, but it works much better for us. We would have PnfsManager DOS weekly. Since we had the patch deployed - we never had any issues. So give it a try first.
BTW: just to add to it - GetFileAtrtributes messages have been processed exactly like that for many years. I just extended the same approach to listing.
Dmitry
We increased the threads from 10 to 100 and the chunk size from 100 to 10,000. We've also tried chunk size 1,000, and 10,000 does not seem to perform different than 1,000, so we'll probably revert to 1,000 during next maintenance.
We see that the maximum waiting time during peak hour is reduced from ~50 minutes to ~15 minutes. The srmls timeout is 2 minutes though, so we're not out of trouble yet.
Now waiting to hear from users.
I think 10K chunk size is a little excessive. How is the memory footprint of the PnfsManager. The numbers you quoted ~ 50 minutes seems staggering to me.
Could you post:
"\sn show list activity -times"
and
"\sn info -a"
output. And, make sure this variable is not set to false:
pnfsmanager.enable.folding
(default is true). Folding is absolutely crucial here.
Also, the 50 minute wait. Do I get it correct you have 1M entries in directory and lists for that directory (and all the unfortunate ones that have the same hashcode % number_of_threads value) queue up?
And in the past having just 2 list threads you had no issues? Of course if your experience is so drastically worse than before I can make the functionality optional, or consider improvements to the algorithm.
The "info -a" output @ Fermilab public dCache system:
Statistics:
PnfsManagerV3 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
PnfsListDirectoryMessage 275.90± 0.32 1 127,943 1,314.32 17,254,002 12 days
...
As you can see max about 2 minutes. And ..... I think anything longer would just fail.
and folding stats
PnfsManagerV3.Folded requests failed
PnfsGetFileAttributes 6392185 0
PnfsListDirectoryMessage 2779888 0
Total 9172073 0
Whereas CMS T1 instance :
Statistics:
PnfsManagerV3 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
PnfsListDirectoryMessage 4.71± 0.01 1 10,395 24.92 15,701,235 35 days
PnfsManagerV3.Folded requests failed
PnfsGetFileAttributes 140852 0
PnfsListDirectoryMessage 3 0
Total 140855 0
Typically I see no activity :
[ssh -p 22224 admin@cmsdcadiskhead "\sn show list activity -times"
Currently no list activity
The 3 million object dir listing usually happens only a few times per day. It's a backup from an object store. That explains why it's so many objects in a single dir: object stores usually don't care about a directory structure.
Indeed, all dir listings that were so unlucky to be assigned to the same thread, will have to wait for that listing to complete. Even with 100 threads, I can see around ~10 listings being queued.
It's a long time ago that we had the default of 2 threads. And yes, we had problems back then, that's why we increased it to 10. With 10, we had no problems, until the algorithm change in 8.2.23.
Srmls has a timeout of 2 minutes, but WebDAV operations don't. The 3 million objects listing is done through WebDAV. We've had to set up a special WebDAV door with a lot of memory, because otherwise it would OOM while providing the dir listing.
dcache.java.memory.heap=250G
dcache.java.memory.direct=100G
I'm aware we're pushing the boundaries here. 😉
The PnfsManager up to now had 8GB of heap, we've just increased it to 50 GB. (No OOM here though; just trying to avoid garbage collection)
The show list activity -times
currently is empty, I'll try again during "rush hour".
[root@dcmain ~]# dcache-admin-command PnfsManager 'info -a'
--- ha-service-leadership-manager (Coordinates which pnfsmanager updates the filesystem stat cache) ---
HA role: LEADER
--- name-space-provider (Name space provider) ---
Acl Enabled: true
DB : HikariDataSource (HikariPool-1)
DB Engine : PostgreSQL 11.20
rootID : 000000000000000000000000000000000000
FsId : 0
Statistics:
ChimeraNameSpaceProvider average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
addInodeLocation(FsInode|int|Strin 0.10± 0.00 0 60 0.31 2,648,476 28 hours
clearInodeLocation(FsInode|int|Str 0.07± 0.00 0 108 0.29 1,986,657 28 hours
createFile(FsInode|String|int|int| 0.59± 0.00 0 616 0.88 4,394,414 28 hours
find(FsInode) 0.03± 0.00 0 148 0.19 85,238,126 28 hours
getACL(FsInode) 0.03± 0.00 0 136 0.18 221,302,823 28 hours
getAllTags(FsInode) 0.05± 0.00 0 339 0.23 72,274,940 28 hours
getInfo() 0.01± 0.00 0 10 0.15 11,946 28 hours
getInodeChecksums(FsInode) 0.04± 0.00 0 185 0.19 72,292,628 28 hours
getInodeLocations(FsInode) 0.04± 0.00 0 171 0.21 69,568,667 28 hours
getInodeLocations(FsInode|int) 0.29± 0.13 0 1 0.47 14 26 hours
getLabels(FsInode) 0.16± 0.01 0 1 0.37 1,638 28 hours
getStorageInfo(FsInode) 0.04± 0.00 0 15 0.19 1,959,977 28 hours
getXattr(FsInode|String) 0.05± 0.00 0 219 0.23 51,344,696 28 hours
id2inode(String|StatCacheOption) 0.04± 0.00 0 123 0.21 57,420,721 28 hours
inode2id(FsInode) 0.00± 0.00 0 13 0.04 18,723,880 28 hours
inode2path(FsInode) 0.13± 0.00 0 15 0.34 5,047,577 28 hours
inodeOf(FsInode|String|StatCacheOp 0.07± 0.00 0 185 0.28 2,240,736 28 hours
listXattrs(FsInode) 0.05± 0.00 0 186 0.23 81,679,273 28 hours
mkdir(FsInode|String|int|int|int) 0.49± 0.00 0 9 0.60 18,719 28 hours
mkdir(FsInode|String|int|int|int|L 2.53± 0.09 0 30 2.69 1,000 28 hours
newDirectoryStream(FsInode) 14.06± 0.54 0 7,718 97.65 32,246 28 hours
path2inode(String) 0.21± 0.00 0 63 0.42 4,918,729 28 hours
path2inodes(String) 0.30± 0.00 0 188 0.50 30,173,679 28 hours
read(FsInode|int|long|byte[]|int|i 0.05± 0.00 0 1 0.21 5,711 28 hours
remove(FsInode|String|FsInode) 0.69± 0.00 0 409 1.61 1,982,024 28 hours
rename(FsInode|FsInode|String|FsIn 0.21± 0.00 0 7 0.42 38,622 28 hours
resolvePath(String) 0.29± 0.00 0 190 0.51 12,032,066 28 hours
setInodeAttributes(FsInode|int|Sta 0.12± 0.00 0 54 0.35 2,635,645 28 hours
setInodeChecksum(FsInode|int|Strin 0.06± 0.00 0 38 0.25 2,636,635 28 hours
setStorageInfo(FsInode|InodeStorag 0.36± 0.02 0 1 0.48 480 28 hours
setXattr(FsInode|String|byte[]|Set 0.13± 0.00 0 93 0.35 2,553,683 28 hours
stat(FsInode|int) 0.05± 0.00 0 75 0.23 8,276,882 28 hours
updateFsStat() 37,629.11± 718.01 35,251 51,974 3,799.34 28 27 hours
ChimeraNameSpaceProvider requests failed
addInodeLocation(FsInode|int|Strin 2648476 34
clearInodeLocation(FsInode|int|Str 1986657 0
createFile(FsInode|String|int|int| 4394414 1758952
find(FsInode) 85238126 0
getACL(FsInode) 221302823 0
getAllTags(FsInode) 72274940 0
getInfo() 11946 0
getInodeChecksums(FsInode) 72292628 0
getInodeLocations(FsInode) 69568667 0
getInodeLocations(FsInode|int) 14 0
getLabels(FsInode) 1638 0
getStorageInfo(FsInode) 1959977 0
getXattr(FsInode|String) 51344696 0
id2inode(String|StatCacheOption) 57420721 1923593
inode2id(FsInode) 18723880 0
inode2path(FsInode) 5047577 0
inodeOf(FsInode|String|StatCacheOp 2240736 257555
listXattrs(FsInode) 81679273 0
mkdir(FsInode|String|int|int|int) 18719 10792
mkdir(FsInode|String|int|int|int|L 1000 0
newDirectoryStream(FsInode) 32246 0
path2inode(String) 4918729 217914
path2inodes(String) 30173679 2144632
read(FsInode|int|long|byte[]|int|i 5711 0
remove(FsInode|String|FsInode) 1982024 1
rename(FsInode|FsInode|String|FsIn 38622 0
resolvePath(String) 12032066 0
setInodeAttributes(FsInode|int|Sta 2635645 116
setInodeChecksum(FsInode|int|Strin 2636635 0
setStorageInfo(FsInode|InodeStorag 480 0
setXattr(FsInode|String|byte[]|Set 2553683 0
stat(FsInode|int) 8276882 5618667
updateFsStat() 28 0
Total 813443338 11932256
--- pnfs-manager-ionotify ---
atime precision: Disabled
List queue: 0
Threads (24) Queue
[0] 0
[1] 0
[2] 0
[3] 0
[4] 0
[5] 0
[6] 0
[7] 0
[8] 0
[9] 0
[10] 0
[11] 0
[12] 0
[13] 0
[14] 0
[15] 0
[16] 0
[17] 0
[18] 0
[19] 0
[20] 0
[21] 0
[22] 0
[23] 0
Threads: 0
Statistics:
PnfsManagerV3 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
PnfsAddCacheLocationMessage 0.38± 0.02 0 4 0.52 915 28 hours
PnfsCancelUpload 3.05± 0.50 1 28 3.73 55 28 hours
PnfsClearCacheLocationMessage 0.21± 0.00 0 109 0.50 1,986,987 28 hours
PnfsCommitUpload 2.87± 0.05 2 23 1.46 945 28 hours
PnfsCreateEntryMessage 1.65± 0.00 0 641 1.41 4,421,978 28 hours
PnfsCreateSymLinkMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsCreateUploadPath 3.86± 0.09 1 32 3.00 1,000 28 hours
PnfsDeleteEntryMessage 1.71± 0.00 0 432 2.07 2,200,809 28 hours
PnfsFlagMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsGetCacheLocationsMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsGetFileAttributes 0.71± 0.00 0 189 0.81 30,810,075 28 hours
PnfsGetParentMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsListDirectoryMessage 591.89± 64.05 1 967,950 11,537.54 32,448 28 hours
PnfsListExtendedAttributesMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsMapPathMessage 0.25± 0.00 0 18 0.43 2,432,662 28 hours
PnfsReadExtendedAttributesMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsRemoveChecksumMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsRemoveExtendedAttributesMessag NaN± NaN 0 0 NaN 0 28 hours
PnfsRemoveLabelsMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsRenameMessage 1.72± 0.00 0 60 0.97 37,677 28 hours
PnfsSetFileAttributes 0.58± 0.00 0 363 0.92 2,647,360 28 hours
PnfsWriteExtendedAttributesMessage NaN± NaN 0 0 NaN 0 28 hours
PoolFileFlushedMessage 0.83± 0.04 0 16 0.86 480 28 hours
PnfsManagerV3.Folded requests failed
PnfsGetFileAttributes 176923 0
PnfsListDirectoryMessage 38 0
Total 176961 0
Max 15 minutes for directory listing from above. Very little folding, meaning the same directory is not being hit often enough to trigger folding. OK, Lets wait for rush hour data.
P.S.: srmls timeout 2 minute. What are those? Are they "gfal-ls" executed by some SAM tests? And 2 minute is "their", that is client timeout? Or is it our internal timeout:
srmmanager.service.pnfsmanager.timeout = 120
I believe that when this one is hit srmls fails the request but of course PnfsManager still keeps processing it.
So I suggest to raise the above timeout to 15 minutes (based on your "info -a" output)
similarly for webdav:
webdav.service.pnfsmanager.timeout = 120000
(mind the units). Same 2 minute timeout. Probably needs to be raised to 15 minutes. as well.
I missed the rush hour, will try again tomorrow.
But saw a large spike in our listing queue graphs. It appears we had a listing that lasted 4028 seconds. I didn't expect that. In the end, 18 other listings were waiting for this one to finish.
[root@install-fes ~]# dcache-admin-command PnfsManager 'info -a' | grep PnfsListDirectoryMessage
PnfsListDirectoryMessage 708.14± 81.45 1 4,028,875 20,249.28 61,801 2 days
PnfsListDirectoryMessage 74 0
Question. Can you identify the large directory that takes 4K seconds to process and see what protocol has issued the listing (srm, webdav?) and that this listing indeed succeeded and did not fail with timeout. Also, has the queue ever contain the same large directory (which could be an indication of retry by the door).
By now I realize that your usage pattern is different - occasionally or periodically your system runs a long listing and with the new algorithm it backs few unfortunate others up while it is running. This is opposed to our use case when many users run listing on the same, sometimes largish directories at the same time. We typically ask users to keep entries under 10K. There are few cases of ~ up to 200K but never millions.
I think this calls for making the algorithm optional for now.
At the same time I can see if I can make it more sophisticated.
NB: previous algorithm would fail if you gave 10 such queries running, all 10 threads will be locked for 4K seconds not allowing any other listings to go through. Increasing number of listing threads would kill the DB. This is basically what we had run into.
I think we have a very good informative listing here. I ran the command in a loop every minute for some hours. This is the last listing with a very long listing operation, right before it completed and the queue was drained. When finally the very large dir had finished, all 22 waiting list operations completed within one minute (because the next listing a minute later was completely empty).
Thu Aug 3 11:40:19 CEST 2023
QUEUED REQUESTS
SOURCE USER ARRIVED STARTED PATH
SrmManager@srmmanagerDomain lofops 2023-08-03 10:35:57.209 (64 min ago) /pnfs/grid.sara.nl/data/lofar/ops/projects/lc0_003/151812
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:57:58.960 (42 min ago) /pnfs/grid.sara.nl/data/swiftbackup/rivm/owncloudc02
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:59:59.006 (40 min ago) /pnfs/grid.sara.nl/data/swiftbackup/rivm/owncloudc02
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:00:00.537 (40 min ago) /pnfs/grid.sara.nl/data/swiftbackup/uva/owncloudf70
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:02:00.568 (38 min ago) /pnfs/grid.sara.nl/data/swiftbackup/uva/owncloudf70
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:03:34.951 (36 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud18c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:05:34.977 (34 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud18c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:07:08.401 (33 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloud9ad
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:08:04.911 (32 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud3f3
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:09:08.467 (31 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloud9ad
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:10:05.194 (30 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud3f3
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:11:17.583 (29 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudacb
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:13:17.676 (27 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudacb
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:18:59.372 (21 min ago) /pnfs/grid.sara.nl/data/swiftbackup/buas/owncloud21c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:20:55.467 (19 min ago) /pnfs/grid.sara.nl/data/swiftbackup/zuyd/owncloude13
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:20:59.534 (19 min ago) /pnfs/grid.sara.nl/data/swiftbackup/buas/owncloud21c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:22:40.392 (17 min ago) /pnfs/grid.sara.nl/data/swiftbackup/che/owncloud894
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:22:55.520 (17 min ago) /pnfs/grid.sara.nl/data/swiftbackup/zuyd/owncloude13
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:24:40.459 (15 min ago) /pnfs/grid.sara.nl/data/swiftbackup/che/owncloud894
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:26:54.862 (13 min ago) /pnfs/grid.sara.nl/data/swiftbackup/hvhl/owncloud593
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:28:54.906 (11 min ago) /pnfs/grid.sara.nl/data/swiftbackup/hvhl/owncloud593
SrmManager@srmmanagerDomain lofops 2023-08-03 11:30:59.602 (9 min ago) /pnfs/grid.sara.nl/data/lofar/ops/projects/lc0_003/183367
ACTIVE REQUESTS
SOURCE USER ARRIVED STARTED PATH
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:33:17.862 (67 min ago) 2023-08-03 10:33:17.862 (67 min ago) /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
Here's the associated graph:
Does /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
have millions of entries? And do you have a way of finding our that the listing actually succeeded? (see my previous message). Before I move on to touch the code I wanted to make sure that we are not seeing PnfsManager just spinning its wheels? I think WebDAV door access file would have an answer (if listing completed or timed out).
(The more I think about a listing that lasts > 1 hour, the less I am convinced that it is actually needed and could be an artifact of some other activity, like inadvertent "ls" whose only purpose is establishing that destination directory exists. We have seen something like that at Fermilab).
I can see that the dir listing gets an OOM error. It's already running with these settings (note the heap & direct memory, we've had to increase that several times in the past):
root 3152466 1 28 05:43 ? 01:23:22 /usr/bin/java -server -Xmx250G -XX:MaxDirectMemorySize=100G -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33000:34500 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/webdav2880-badger14Domain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.2.jar -Djava.net.preferIPv6Addresses=system -Djava.awt.headless=true -DwantLog4jSetup=n -Dorg.bouncycastle.dh.allow_unsafe_p_value=true -Djdk.tls.ephemeralDHKeySize=2048 -Djava.net.preferIPv6Addresses=false -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start webdav2880-badger14Domain
Here's the error:
07 Aug 2023 10:32:52 (webdav2880-badger14) [] /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
java.lang.OutOfMemoryError: null
at java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119)
at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
at java.base/java.io.OutputStream.write(OutputStream.java:122)
at io.milton.http.XmlWriter.append(XmlWriter.java:73)
at io.milton.http.XmlWriter.writeElement(XmlWriter.java:421)
at io.milton.http.XmlWriter.writeElement(XmlWriter.java:127)
at io.milton.http.XmlWriter.writeProperty(XmlWriter.java:114)
at io.milton.http.values.ToStringValueWriter.writeValue(ToStringValueWriter.java:52)
at io.milton.http.values.ValueWriters.writeValue(ValueWriters.java:98)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.sendProperties(PropFindXmlGeneratorHelper.java:124)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.sendKnownProperties(PropFindXmlGeneratorHelper.java:114)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.appendResponse(PropFindXmlGeneratorHelper.java:102)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.appendResponses(PropFindXmlGeneratorHelper.java:91)
at io.milton.http.webdav.PropFindXmlGenerator.generate(PropFindXmlGenerator.java:60)
at io.milton.http.webdav.PropFindXmlGenerator.generate(PropFindXmlGenerator.java:69)
at io.milton.http.webdav.DefaultWebDavResponseHandler.respondPropFind(DefaultWebDavResponseHandler.java:202)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at org.dcache.webdav.AcceptAwareResponseHandler.respondPropFind(AcceptAwareResponseHandler.java:116)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at org.dcache.webdav.WorkaroundsResponseHandler.respondPropFind(WorkaroundsResponseHandler.java:138)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at io.milton.http.webdav.PropFindHandler.processExistingResource(PropFindHandler.java:156)
at io.milton.http.ResourceHandlerHelper.processResource(ResourceHandlerHelper.java:196)
at io.milton.http.webdav.PropFindHandler.processResource(PropFindHandler.java:112)
at io.milton.http.ResourceHandlerHelper.process(ResourceHandlerHelper.java:127)
at org.dcache.webdav.DcacheResourceHandlerHelper.process(DcacheResourceHandlerHelper.java:42)
at io.milton.http.webdav.PropFindHandler.process(PropFindHandler.java:106)
at org.dcache.webdav.DcacheStandardFilter.process(DcacheStandardFilter.java:50)
at io.milton.http.FilterChain.process(FilterChain.java:46)
at org.dcache.webdav.transfer.CopyFilter.process(CopyFilter.java:276)
at io.milton.http.FilterChain.process(FilterChain.java:46)
at io.milton.http.HttpManager.process(HttpManager.java:158)
at org.dcache.webdav.MiltonHandler.handle(MiltonHandler.java:77)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.dcache.http.AuthenticationHandler.access$001(AuthenticationHandler.java:55)
at org.dcache.http.AuthenticationHandler.lambda$handle$0(AuthenticationHandler.java:156)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
at org.dcache.http.AuthenticationHandler.handle(AuthenticationHandler.java:153)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.dcache.http.AbstractLoggingHandler.handle(AbstractLoggingHandler.java:110)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.base/java.lang.Thread.run(Thread.java:829)
When I did an ls
in an NFS-mounted /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
, it was like a DoS of our PnfsManager. The transfer success rate (for everyone, not only this user) dropped to 20%. After I restarted the PnfsManager things got back to normal. I don't know how NFS interacts with the PnfsManager, but I bet it's not through your algorithm 😉
I have asked the user to tell me how many files are supposed to be in this dir. Perhaps on the client side he might be able to see this more easily.
The client is Rclone. I think Rclone wants to get a dir listing because it only uploads files that don't exist yet in the destination dir, or have changed since last upload.
NFS by-passes PnfsManager altogether when running ls
.
You can check number of files by doing :
select inlink from t_inodes where ipnfsid = '<pnfsid of that directory>';
So, I think, is is fair to conclude that you have a problem with large directory that is:
P.S.: I was surprised that the listing is not just timing out on the door end. May be its hits OOM sooner!?
[root@db1 ~]# psql -U postgres chimera -c "select inlink from t_inodes where ipnfsid = '00005D8D033B902A42249F3A7510D1146E28'"
inlink
----------
13961036
(1 row)
Wow, almost 14 million... Even more than I thought.
Thanks for looking into this. In short term it would be wonderful if the algorithm is configurable. For the future, I foresee more problems with such a huge and growing directory. We might need to set up a separate instance for them so they don't DoS other users. But that's a big operation that we can't do on short notice.
Returning from holiday, I became aware of this issue and ticket. To re-enforce Onno's point, I'd like to share that our case (LOFAR Long-Term Archive) is certainly affected by the new behaviour. LOFAR is significantly different from the FERMI case as I understand it.
For LOFAR
As this looks like spurious behaviour it was not identified until now, although we have seen a slight rise in unexplained operations that take much longer, or are failing, than we are used to for SURF. Looking into the logs for one of our services, it is nevertheless clear that starting July 1st the following error is being reported multiple times per day for listings at SURF: "List failed [rc=10006,msg=Timeout during directory listing.]"
From the thread I conclude that the issue is planned to be addressed by allowing the algorithm for directory listings to be made configurable. For LOFAR this fix will be very important and much appreciated.
The problem will hit you when you used up all available threads when using "old algorithm", increasing the number of list threads will lead to database thrashing leading to timeouts. But yes, we will make the behavior of listing optional ASAP.
Pull request has been made.
@onnozweers are there subdirectories under /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
Not directly related to issue, but we seem to be seeing long running listings by Webdav clients. We suspect that PROPFIND
calls without Depth
header result in recursive listing.
This is because even 15M entries directory just with files should have completed faster than 70 minutes.
P.S.: NVM, not recursive listing, but door receives your 15M entries and starts to loop over them sending PROPFIND
request for each of them. This is where all the time is spent. I just got to the bottom of it.
We saw the patch in 8.2.28. Thanks a lot! I really appreciate the effort you put into this. We'll update soon.
About the PROPFIND
, not sure what you mean.
Do you mean that the door, while getting entries from the PnfsManager, also starts sending PROPFIND
results? And that these slow down the listing operation in the PnfsManager?
Anyway, let me know if you need my help. I'll be out of office for a few weeks though.
PROPFIND calls done by WebDAV collect too much info (including file locality). As the result "ls -l" slows down very significantly Compare:
time gfal-ls -l https://foo.fnal.gov:2880/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20 | wc -l
11937
real 3m0.176s
user 0m2.904s
sys 0m0.728s
vs
time gfal-ls gsiftp://foo.fnal.gov/pnfs/fnal.gov/usr//minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20 |wc
11937 11937 186268
real 0m1.607s
user 0m0.860s
sys 0m0.334s
I think it would be worth clarifying exactly what HTTP/WebDAV requests gfal-ls
is making. For example, is the client making a single request for the directory?
Another important factor is how such a request is this handled within the Milton WebDAV library. Does Milton trigger a single request to PnfsManager, or are there multiple requests?
Just to mention it: file locality is a dCache-specific WebDAV property. It shouldn't be returned unless the client explicitly asks for it. It could be that the WebDAV door is asking for the necessary information (pool and tape locality) from the PnfsManager even if the data isn't going to be used. If so, then that is a bug and should be fixed.
Hi @paulmillar . My first pass over the code suggests that these are always returned. I don't see options controlling whether to include locality. But maybe I'm not looking in the right place.
@paulmillar To me it looks like that when a listing is requested for a directory the WebDAV door loops over initial result and calls PROPFIND on constituents. All file attributes gathered.
private static final Set<FileAttribute> PROPFIND_ATTRIBUTES = Sets.union(
EnumSet.of(CHECKSUM, ACCESS_LATENCY, RETENTION_POLICY),
PoolMonitorV5.getRequiredAttributesForFileLocality());
....
private Set<FileAttribute> buildRequestedAttributes() {
Set<FileAttribute> attributes = EnumSet.copyOf(REQUIRED_ATTRIBUTES);
if (isDigestRequested()) {
attributes.add(CHECKSUM);
}
if (isPropfindRequest()) {
// FIXME: Unfortunately, Milton parses the request body after
// requesting the Resource, so we cannot know which additional
// attributes are being requested; therefore, we must request all
// of them.
attributes.addAll(PROPFIND_ATTRIBUTES);
}
return attributes;
}
(TBH, I already tried just hacking the above and dropping PROPFIND_ATTRIBUTES
completely. I re-run ls -l
and got about the same result. )
But now looks like Al is on it.
NB: gfal ls https://
and gfal ls -l https://
run about the same time.
@paulmillar
grep 'webdavDomain:AAYCqqTulTg' /root/one_listing.log | grep -c "class org.dcache.vehicles.PnfsGetFileAttributes processed"
10002
The log above contains debug level output of this listing:
gfal-ls -l https://foo.fnal.gov:2880/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20 | wc -l
11937
real 3m0.176s
user 0m2.904s
sys 0m0.728s
@paulmillar just to add a bit, pinboard
of PnfsManager
shows
....
28 Aug 2023 13:02:09 [PnfsManager-0] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] Using thread [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20/9885] 56
28 Aug 2023 13:02:09 [proc-56] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] class org.dcache.vehicles.PnfsGetFileAttributes processed in 3 ms
28 Aug 2023 13:02:09 [PnfsManager-0] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] Using thread [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20/9886] 55
28 Aug 2023 13:02:09 [proc-55] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] class org.dcache.vehicles.PnfsGetFileAttributes processed in 4 ms
28 Aug 2023 13:02:09 [PnfsManager-0] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] Using thread [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20/9887] 54
28 Aug 2023 13:02:09 [proc-54] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] class org.dcache.vehicles.PnfsGetFileAttributes processed in 3 ms
28 Aug 2023 13:02:09 [PnfsManager-0] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] Using thread [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20/9888] 53
28 Aug 2023 13:02:09 [proc-53] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] class org.dcache.vehicles.PnfsGetFileAttributes processed in 4 ms
28 Aug 2023 13:02:09 [PnfsManager-0] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] Using thread [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20/9889] 52
28 Aug 2023 13:02:09 [proc-52] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] class org.dcache.vehicles.PnfsGetFileAttributes processed in 3 ms
28 Aug 2023 13:02:09 [PnfsManager-0] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] Using thread [/pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20/989] 50
28 Aug 2023 13:02:09 [proc-50] [door:WebDAV-fndcaitb4-1@webdavDomain:AAYD/3pVHLA WebDAV-fndcaitb4-1 PnfsGetFileAttributes] class org.dcache.vehicles.PnfsGetFileAttributes processed in 3 ms
It looks like a PnfsGetFileAttributes
message sent by the door to PnfsManager. One message per entry in the directory
(the pinboard snippet above correspond to:
gfal-ls -l https://foo.fnal.gov:2880/pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20 | wc -l
11937
real 3m0.176s
user 0m2.904s
sys 0m0.728s
invocation)
So, DcacheDirectoryResource
implements MakeCollectionableResource
, which extends CollectionResource
. The CollectionResource
interface contains the method:
List<? extends Resource> getChildren()
For DcacheDirectoryResource
, this does "the right thing": a single directory listing that creates Resource
objects (either DcacheFileResource
or DcacheDirectoryResource
). However, something is triggering the subsequent PnfsManager requests.
My guess is that the problem is due to the attribute cache not being populated when the DcacheResource
(DcacheFileResource
or DcacheDirectoryResource
) is created during the directory listing. (see DcacheResource#getAllPropertyNames
for example)
After placing a dumpTrace() for the thread that calls the PnfsHandler, I saw that it is in order to determine whether the directories have a WriteToken or not (a fetch of the StorageInfo attribute). Yes, this is because the relevant cache does not contain those entries (so the cache loader fetches them).
What I have done is fork the DcacheDirectoryResource into the standard one and one used specifically in list(), and for the latter not check for those space properties. That sped things up greatly (as well as paring down the attributes for list to simply the POSIX ones). The listing times (this posted by Dmitry) for webdav and xrdfs are now comparable:
[litvinse@host ~]$ time gfal-ls root://example.org/big-directory | wc -l
11937
real 0m2.994s
user 0m0.953s
sys 0m0.382s
[litvinse@host ~]$ time gfal-ls https://example.org:2880/big-directory | wc -l
11937
real 0m3.916s
user 0m1.651s
sys 0m0.356s
I'm not sure I follow.
If the problem (an extra PnfsManager round-trip per directory item) is triggered by a lack of information, why not update the list
code to fetch that additional information as part of the directory listing?
I don't (yet) see why one would need to create an additional class (a fork of DcacheDirectoryResource) with different semantics.
Why do we need that information for listing?
There are two problems that we are dealing with. Even after eliminating the extra fetch, listing of files and directories being roughly the same, it still takes nearly two minutes to list only 1000 entries. This is because of all the extra properties (locality, storageinfo, etc.) which don't really show up in the POSIX listing output.
Dmitry and I have been trying to get the timing down to something reasonable (like what GridFTP does). I fixed it (basically) for xroot and am trying to do something similar for webdav.
But if you have another idea, perhaps you could describe it? Otherwise, we are still going to be burdened here at FNAL with listing that ends up being a DOS.
I'm guessing what's happening here is that the PropFind request is triggering querying the DAV:quota-available-bytes
and/or DAV:quota-used-bytes
properties. Either the client is specifying (in the PropFind request) that it wants these quota properties, or Milton/dCache is deciding to include either (or both) properties by default.
For directories, these properties trigger a PnfsManager lookup to see if the directory is in a token (i.e., whether the WriteToken
tag is set) and uses the space-manager information to return quota information.
Two possible solutions:
Yes, that is what is happening (I just stated it differently above).
Let me report the code which prompted me to subclass the resource:
public Object getProperty(QName name) {
Object value = super.getProperty(name);
if (value != null) {
return value;
}
try {
if (name.equals(QUOTA_AVAILABLE)) {
return _factory.spaceForPath(_path).getAvailableSpaceInBytes();
}
if (name.equals(QUOTA_USED)) {
Space space = _factory.spaceForPath(_path);
return space.getUsedSizeInBytes() + space.getAllocatedSpaceInBytes();
}
} catch (SpaceException e) {
// this path has no space, treat as if property does not exist.
}
return null;
}
@Override
public PropertyMetaData getPropertyMetaData(QName name) {
PropertyMetaData metadata = super.getPropertyMetaData(name);
if (!_factory.isSpaceManaged(_path)) {
return metadata;
}
// Milton accepts null and PropertyMetaData.UNKNOWN to mean the
// property is unknown.
if ((metadata == null || metadata.isUnknown()) && QUOTA_PROPERTIES.contains(name)) {
metadata = READONLY_LONG;
}
return metadata;
}
@Override
public List<QName> getAllPropertyNames() {
List<QName> genericNames = super.getAllPropertyNames();
if (!_factory.isSpaceManaged(_path)) {
return genericNames;
}
List<QName> names = new ArrayList<>(QUOTA_PROPERTIES.size() + genericNames.size());
names.addAll(QUOTA_PROPERTIES);
names.addAll(genericNames);
return names;
}
Even if we eliminate those properties so that getProperty() is not called with them, the other two methods have a forced check of the cache (isSpaceManaged).
Hi Dmitry,
I upgraded to 8.2.32 last Friday to revert to the old algorithm but it does not appear to work. Or did I do something wrong? 🤔
[namespaceDomain]
dcache.java.memory.direct=50g
dcache.java.memory.heap=50g
[namespaceDomain/pnfsmanager]
# pnfsmanager.limits.threads default is 12
pnfsmanager.limits.threads=24
#
# list-threads default 2
pnfsmanager.limits.list-threads = 30
#
# Fix dir listing queueing: https://github.com/dCache/dcache/issues/7252
pnfsmanager.enable.parallel-listing = true
#
# Chunk size default is 100, but that's way too small
# for dirs with 3 million files
pnfsmanager.limits.list-chunk-size = 10000
[root@db1 /etc/dcache]# dcache version
8.2.32
I understand this has become a long ticket and it's difficult to keep track of it. But could one of the dCache devs please have a look at my previous comment?
Cheers, Onno
Hi Onno,
sorry, I have not seen your activity on this ticket. I am frankly stumped. The switch :
pnfsmanager.enable.parallel-listing = true
reverts to the previous code path without any changes.
During the queuing, can you provide output of
\sn show list activity -times
and
\sn info -a
It will be easy to see what the situation is. Is it still one ls being executed and a lot waiting or all 30 threads are executing. (in my testing I observed "old behavior" - I run many simutaneous listings of the same large directory and observed N threads processing the same directory)
Or can you find a sufficiently large directory (say 5K entries) and do
for i in {1..10}
do
gfal-ls https://..... > $i.log 2>&1&
done
and run
ssh -p .... "\sn show list activity -times "
(may run several times)
dCache as you configured it should show 10 running listings.
Inversely, the
pnfsmanager.enable.parallel-listing = false
would show 1 running, 9 queued.
(so no need to wait for "rush hour").
I have installed 8.2.32, and set
pnfsmanager.enable.parallel-listing = true
pnfsmanager.limits.list-threads = 24
I run script:
$ cat ls_test.sh
#!/bin/bash
for i in {1..100};
do
gfal-ls https://fndcaitb4:2880/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20 > webdav_${i}.log 2>&1&
done
This is what see:
$ ssh -p 24223 enstore@fndcaitb3 "\sn show list activity -times"
QUEUED REQUESTS
SOURCE USER ARRIVED STARTED PATH
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.115 (3 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
...
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:12.240 (842 ms ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
ACTIVE REQUESTS
SOURCE USER ARRIVED STARTED PATH
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:07.771 (5 s ago) 2023-10-10 15:03:07.785 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:07.829 (5 s ago) 2023-10-10 15:03:07.831 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:08.030 (5 s ago) 2023-10-10 15:03:08.031 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:08.030 (5 s ago) 2023-10-10 15:03:08.032 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:08.032 (5 s ago) 2023-10-10 15:03:08.034 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:08.033 (5 s ago) 2023-10-10 15:03:08.034 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:08.034 (5 s ago) 2023-10-10 15:03:08.036 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:08.036 (5 s ago) 2023-10-10 15:03:08.037 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:08.037 (5 s ago) 2023-10-10 15:03:08.038 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.057 (4 s ago) 2023-10-10 15:03:09.062 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.058 (4 s ago) 2023-10-10 15:03:09.062 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.058 (4 s ago) 2023-10-10 15:03:09.063 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.059 (4 s ago) 2023-10-10 15:03:09.063 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.060 (4 s ago) 2023-10-10 15:03:09.063 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.078 (4 s ago) 2023-10-10 15:03:09.080 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.079 (4 s ago) 2023-10-10 15:03:09.120 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.109 (4 s ago) 2023-10-10 15:03:09.124 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.113 (4 s ago) 2023-10-10 15:03:09.135 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.109 (4 s ago) 2023-10-10 15:03:09.125 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.113 (4 s ago) 2023-10-10 15:03:09.135 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.114 (4 s ago) 2023-10-10 15:03:09.136 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.114 (4 s ago) 2023-10-10 15:03:09.136 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.115 (4 s ago) 2023-10-10 15:03:09.136 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:03:09.115 (4 s ago) 2023-10-10 15:03:09.136 (4 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
24 threads busy listing the same directory, while anything in excess of 24 is QUEUED.
I have:
[litvinse@fnisd1 ~]$ wc -l webdav_70.log
11937 webdav_70.log
11937 entries in that directory.
Now, I set:
pnfsmanager.enable.parallel-listing = false
pnfsmanager.limits.list-threads = 24
And re-run the same script. I see:
[litvinse@orbis ~]$ ssh -p 24223 enstore@fndcaitb3 "\sn show list activity -times"
QUEUED REQUESTS
SOURCE USER ARRIVED STARTED PATH
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:10:34.047 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:10:34.097 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:10:34.360 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:10:34.360 (5 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
...
ACTIVE REQUESTS
SOURCE USER ARRIVED STARTED PATH
WebDAV-fndcaitb4-1@webdavDomain litvinse 2023-10-10 15:10:33.907 (6 s ago) 2023-10-10 15:10:33.918 (6 s ago) /pnfs/fnal.gov/usr/minerva/mc_generation/mc-cal-hepevent/mc/v10r8p9/00/01/12/20
So PnfsManager behaves like I would expect ("old style" vs "new style")
(The difference of "new style" is that all these QUEUED request completed immediately after the first one was done)
So can you do something like that so we can see what is going on. Don't need to go extreme with 100 listings. May be twice the number of listing threads.
Also: if you can add to that gratia plot number of queued and number of ACTIVE, that would also answer the question - do we have all threads busy or not. If all threads are busy - you will get queuing (this is what we used run into at Fermilab all the time and that's why this patch was made (parallel -> sequential for a given listing))
"new style" (pnfsmanager.enable.parallel-listing = false
) all 100 listings completed under 1.5 minutes (and I could listings of other directories meanwhile).
"old style" (pnfsmanager.enable.parallel-listing = true
) all 100 listings completed in 2 minutes, but no other listings could be made while all these were being processed using all available threads.
Hi Dmitry,
I'm as stumped as you are.
I created a dir in mijn dCache home dir with more than 10,000 small files, then ran gfal-ls as you described:
onno$ for session in {1..10} ; do { gfal-ls https://webdav.grid.surfsara.nl:2884/pnfs/grid.sara.nl/data/users/onno/disk/largedir | wc -l ; } & done
This takes ~6 seconds.
Meanwhile in the PnfsManager:
[root@install-fes ~]# dcache-admin-command PnfsManager 'show list activity -times'
QUEUED REQUESTS
SOURCE USER ARRIVED STARTED PATH
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.153 (621 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.162 (612 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.176 (598 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.176 (598 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.176 (598 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.179 (595 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.181 (593 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.185 (589 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.191 (583 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
ACTIVE REQUESTS
SOURCE USER ARRIVED STARTED PATH
webdav2884-penguin14@webdav2884-penguin14Domain onno 2023-10-11 17:00:25.151 (623 ms ago) 2023-10-11 17:00:25.151 (623 ms ago) /pnfs/grid.sara.nl/data/users/onno/disk/largedir
webdav2880-seacow14@webdav2880-seacow14Domain return-mmilenkovic 2023-10-11 17:00:25.724 (50 ms ago) 2023-10-11 17:00:25.724 (50 ms ago) /pnfs/grid.sara.nl/data/return/disk/S1_SA_UPSCALE/E069N075T3
webdav2880-seacow12@webdav2880-seacow12Domain return-mmilenkovic 2023-10-11 17:00:25.696 (78 ms ago) 2023-10-11 17:00:25.696 (78 ms ago) /pnfs/grid.sara.nl/data/return/disk/S1_SA_UPSCALE/E069N072T3
This looks like the new algorithm to me... No idea why.
Hi Onno,
Yes, It does exactly look like new algo. I hate when unexplained phenomena happen. Do you have a test system where we may at least turn on some debug may be?
(meanwhile I will investigate if logging can be set so that we can see what is passed to spring cell)
By all means looks like pnfsmanager.enable.parallel-listing = true
is somehow ignored in your setup.
Does running dcache check-config
on your pnfsmanager host say anything ?
Do you have single PnfsManager or multiple?
How flexible are you to deploying on a test system a slight modified 8.2 where I could add an INFO message reporting the value of this variable after initialization (if all other ways to infer what was being passed have failed).
Dear dCache devs,
I have posted this also on the user forum but there was no reply. However, this issue is affecting our users.
We have configured this value so that 10 directory listings can be processed simultaneously instead of the default 2.
And this used to work. But now when I look at the list activity, I see only one request active, and the others queued.
I’m afraid this causes problems to our users: see attached graph. We have some extremely large directories (3 million objects), so we absolutely need some parallelism.
Cheers, Onno