Bazillion of ListBucket issued

fredDJSonos commented 4 months ago

Mountpoint for Amazon S3 version

mount-s3 1.7.0

AWS Region

us-east-1

Describe the running environment

Running inside an EKS cluster with mountpoint-s3-csi-driver.

Mountpoint options

mountOptions:
      - allow-other
      - region us-east-1
      - cache /tmp  # specify cache directory, relative to root host filesystem
      - metadata-ttl 600  # 10min time to live for metadata cached
      - max-cache-size 1024  # 1024MB maximum cache size

What happened?

Our business code essentially open a file at a given path for reading its content. It might stat a given path. But no dir listing whatsoever happens. If we were to use S3 directly, we would just call GetObject, and nothing else.

We investigate to use mountpoint-s3 and discover that the dominant cost (from cost explorer) is the ListBucket action.
For historical reasons, we have a folder structure inherited from a real FS. It looks like this:

/stuff/a/aa/aaXXXX.dat
···
/stuff/a/ab/abXXXX.dat
···
/stuff/b/ba, bb, bc, bd, …, bz
/stuff/c, d, e, … z

We have 150 millions files distributed in this folder structure.

I’m aware of this issue https://github.com/awslabs/mountpoint-s3/issues/770. I wonder if you could propose an implementation that does no lookup for intermediate folders. You could pretend to fuse that all possible directory paths exist, without checking that on S3. When there is a syscall to get a file or list the content of dir, then and only then, you would call S3.

Relevant log output

No response

fredDJSonos commented 4 months ago

I realise that writing this, that my situation should improve by raising the metadata-ttl (make it unlimited)

fredDJSonos commented 4 months ago

Well, in fact. Setting metadata-ttl to unlimited does not appear to change anything. We still observe a giant amount of ListBucket commands (45% of all the requests), and they cost 90% of the bill.

I’m not an expert on how AWS charges ListBucket, but is it expected that ListBucket is so much more expensive than GetObject and HeadObject (respectively 5% of the total cost) ?

Unless there is a bug in your usage of ListBucket ? You might transfer the entire prefix of /stuff/a/ab when you just want to test the directoryness of /stuff/a/ab. Is it a possibility ?

dannycjones commented 4 months ago

Hey @fredDJSonos,

I realise that writing this, that my situation should improve by raising the metadata-ttl (make it unlimited)

Yes, if your workload can tolerate stale entries or even its expected that the bucket content won't change, we'd recommend picking the longest reasonable TTL. If you never expect the content to change during the workload, you can use --metadata-ttl indefinite. This caches results for lookup FUSE requests, which are used by the Kernel to build its own tree of files and directories, but also to serve open and stat system calls.

I wonder if you could propose an implementation that does no lookup for intermediate folders. You could pretend to fuse that all possible directory paths exist, without checking that on S3. When there is a syscall to get a file or list the content of dir, then and only then, you would call S3.

Thanks for sharing the suggestion. It's something we've considered. The method for learning about a directory entry in FUSE does not include the purpose of the request, and so it's not possible to know if the application intends to - unfortunately the protocol does not indicate if it wants to learn about a file or a directory. This means that once we tell the Kernel that some path component is a directory, it will treat it like a directory from that point on without consulting Mountpoint. It's also a challenge faced in #891 where we want to allow access to directories within a bucket without having access to the paths at the root.

Well, in fact. Setting metadata-ttl to unlimited does not appear to change anything. We still observe a giant amount of ListBucket commands (45% of all the requests), and they cost 90% of the bill.

It does depend on the key space. If your workload can tolerate stale entries or even its expected that the bucket content won't change, we'd recommend picking the longest reasonable TTL. It will ensure that repeated lookups can be served from the cache and not need to go to S3. That means for opening files /stuff/a/ab and /stuff/a/ac, the lookup requests (FUSE) for /stuff/ and /stuff/a/ can be served from cache the second time around.

The number of requests for opening a path without metadata caching could be expressed like O(depth * n), where n is the number of files. depth is at what level the file is nested. By turning on metadata caching, you can eliminate depth here but you still have lookups for each file.

If its possible, performing a list of the directory before opening the files can help here as it will perform one listing through the prefix which will allow all the children to be cached.

I’m not an expert on how AWS charges ListBucket, but is it expected that ListBucket is so much more expensive than GetObject and HeadObject (respectively 5% of the total cost) ?

ListObjectsV2 (referenced as ListBucket in billing) does cost more than object-level requests. The pricing is available for your region on the billing page under "Requests & data retrievals". https://aws.amazon.com/s3/pricing/

Unless there is a bug in your usage of ListBucket ? You might transfer the entire prefix of /stuff/a/ab when you just want to test the directoryness of /stuff/a/ab. Is it a possibility ?

It's not possible to avoid the traversal, although we sure wish that the protocol could support it. We actually implement a small amount of caching (1 second) even when caching is turned off to try to reduce immediately making calls to the same directory again (details. The best option if you can though is to extend the metadata TTL for as long a duration as works for your workload.

Ultimately, I'd make the following recommendations:

Use the longest metadata TTL that works for your workload. If it's OK for the workload to never see new changes (such as we know the bucket never changes), there is the option to use --metadata-ttl indefinite.
Since you have a large number of files within each prefix, I would recommend to list the contents of the directory if possible before opening files. This will prepopulate the metadata cache much more effectively.
In your example, you include /stuff/ as a common path at the root. If that's part of your bucket, I'd recommend to use the argument --prefix stuff/ and then that never even needs to be looked up in S3.

fredDJSonos commented 4 months ago

Thanks for your answer. Just to be clear, our last experiment was with --metadata-ttl indefinite and my previous comment was about the fact it did not changed anything (same amount of ListBucket) New config:

    mountOptions:
      - allow-other
      - region us-east-1
      - cache /tmp  # specify cache directory, relative to root host filesystem
      - metadata-ttl indefinite  # https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#metadata-cache
      - max-cache-size 512  # 512MB maximum cache size
      - max-threads 64  # increasing max-threads

fredDJSonos commented 4 months ago

I wonder if you could propose an implementation that does no lookup for intermediate folders. You could pretend to fuse that all possible directory paths exist, without checking that on S3. When there is a syscall to get a file or list the content of dir, then and only then, you would call S3.

Thanks for sharing the suggestion. It's something we've considered. The method for learning about a directory entry in FUSE does not include the purpose of the request, and so it's not possible to know if the application intends to - unfortunately the protocol does not indicate if it wants to learn about a file or a directory. This means that once we tell the Kernel that some path component is a directory, it will treat it like a directory from that point on without consulting Mountpoint. It's also a challenge faced in #891 where we want to allow access to directories within a bucket without having access to the paths at the root.

I guess you talk about the lookup handler you have to provide to fuse. When you reply with fuse_reply_entry the struct, fuse_entry_param has two fields attr_timeout and entry_timeout that can probably be used to tell the kernel to stop caching anything. The fuse API has to be robust against any filesystem that mutates on its own (so it is legit to have a pseudo imaginary directory that suddenly turns into a regular file).

Then it also solved #891

At the end this gives a weirdo filesystem, where all the possible directories appear to exist. But since directories don’t really exist in S3, that’s ok.

fredDJSonos commented 4 months ago

In case there is a problem when the kernel does a lookup on a real file and we pretend it is a directory. (Maybe that breaks the open syscall. I’m not familiar with the details of linux VFS.) There could be an intermediate strategy: We only do a HeadObject in lookup (no more ListBucket)

it’s cheaper
if there is nothing (or a right access error), pretend it is a dummy directory
otherwise, it is a file, and you have all the info to fill the attr field

That still works for #891.

awslabs / mountpoint-s3