Listing of large directories is not sequential

aasenov commented 6 years ago

When you list large folders, the list() method is called relatively for every 100 elements. I keep track of elements already retrieved from the set and the once that are left for retrieving. But every time a sequential list() is called, the cookie is not the expected one. For example, on first list() call, 115 elements are retrieved from the NavigableSet, but the next list() call is with cookie = 100 and I have to move my pointer backwards.

When I put org.dcache.nfs.v4.OperationREADDIR on debug, i saw the following result:

Sending 115 entries (32556 bytes from 32680, dircount = 8170) cookie = 0 EOF=false Sending 115 entries (32552 bytes from 32768, dircount = 8192) cookie = 100 EOF=false Sending 115 entries (32540 bytes from 32768, dircount = 8192) cookie = 202 EOF=false Sending 115 entries (32554 bytes from 32768, dircount = 8192) cookie = 304 EOF=false

Every time the cookie received is not equal to initial cookie + entriesSend. This's an issue on my end, as I use database like file system, and when directory contain thousands of entries, I use lazy collection of entries, loading only 1k files in memory. When they're exhausted, I remove them and load next 1k. I may hit a case, where I move to next batch of 1k files, but the client want to continue listing from element, that was in previous 1k batch.

kofemann commented 6 years ago

Looks like this behavior is related to Linux kernel internal memory alignment. Tough each readdir requests ask for max 32k reply, as soon as pre-allocated 32K memory is filled up, a new 32k portion is allocated and the next readdir is request starts with cookie of last entry in previously used page.

kofemann commented 6 years ago

According to Linux nfs developers, this is en expected behavior though is not optimal:

https://www.spinics.net/lists/linux-nfs/msg70147.html

To handle such situations the VfsCache should be used. However, I understand, that you want to avoid initial full directory list to populate the cache. For you use-case, a smarter, a-la page based caching will make more sense.

aasenov commented 6 years ago

Thanks for the fast response. I'll implement some batch caching, as loading entire directory in memory is not feasible. So my set will keep last batch of elements send, and if next read request any of them, I'll retrieve them from cache and then will continue with the others.

dCache / nfs4j

Listing of large directories is not sequential #62