GoogleCloudPlatform / gcsfuse

A user-space file system for interacting with Google Cloud Storage
https://cloud.google.com/storage/docs/gcs-fuse
Apache License 2.0
2.06k stars 431 forks source link

Support cache exclusion based on file name pattern #2704

Open siddharthab opened 1 day ago

siddharthab commented 1 day ago

Feature Request The current logic to cache a file on disk assumes that if the user starts reading the file from 0 offset then the entire file is supposed to be read and so should be preemptively cached. However, this logic assumes the absence of file headers at the beginning of the file, and/or the absence of magic number checks.

For example, in Bioinformatics, there are BAM files that can be 100+ GB in file size. They are typically meant to be stored remotely and accessed only through random reads. Random access is usually enabled through use of a separate index file. However, these files also have metadata stored in the file header that all clients will want to read first. An attempt to read the metadata from the file header will make gcsfuse assume that the entire file will be read and gcsfuse will begin caching the entire file in on-disk cache. This can very quickly deplete the available cache capacity. So the user might want to exclude only these special files while still caching all other files.

Proposed solution The configuration can include options to exclude files from on-disk cache if their names follow certain patterns. An example implementation is provided at #2043.

ashmeenkaur commented 1 day ago

Using a regex pattern to exclude certain files from the file cache seems like a good addition. I can see that this might be useful for other use cases as well where customers want to exclude certain files from the cache.

What are your thoughts on this, @marcoa6? Can we proceed with the implementation behind an experimental flag?

cc: @vadlakondaswetha @charith87

marcoa6 commented 15 hours ago

LGTM

ashmeenkaur commented 2 hours ago

Thanks @marcoa6! @siddharthab we can proceed with the implementation behind hidden flag experimental-file-cache-exclude-regex.

siddharthab commented 1 hour ago

Thanks. I will redo the PR as per the testing instructions in the contributions guide.