Unidata / netcdf-java

The Unidata netcdf-java library
https://docs.unidata.ucar.edu/netcdf-java/current/userguide/index.html
BSD 3-Clause "New" or "Revised" License
146 stars 71 forks source link

Fix MFileS3::exists to work for "directories" #1170

Closed tdrwenski closed 1 year ago

tdrwenski commented 1 year ago

Description of Changes

Part of the fixes needed for S3 datasetScans.

The current MFileS3::exists only works for Objects, and uses a headObject request. This PR would extend this to work for buckets (need to do a headBucket) and "directories" ( by doing a listObjects and checking it has objects).

Note that "Directories" are defined in our code as URIs that have a delimiter fragment and also end with that delimiter unless it's a bucket (e.g. cdms3:myBucket#delimiter=/ or cdms3:myBucket?myKey/#delimiter=/).

I wonder if cdms3:myBucket should also be considered a directory?

PR Checklist

lesserwhirls commented 1 year ago

I wonder if cdms3:myBucket should also be considered a directory?

That's a good question. I had thought of it like this in terms of a mapping to file systems.

cdms3:myBucket#delimiter=/ was essentially saying that the objects in bucket the were laid out such that they could be interpreted as filesystem paths, and the bucket was considered to be the filesystem root "/". Performing a list-objects-v2 request against the bucket with a delimiter of "/" would then be like doing ls /. In the json return from AWS, "Contents": [...] would then, essentially, contain a list of the "files" directly under "/", while "CommonPrefixes": [..] would contain a list of directories directly under "/". So in that sense, everything in the bucket is accessible as if it were filesystem paths without too many surprises. We can then browse the bucket using a combination of delimiter="/" and prefix (selected from the common prefixes at a given "level" of the bucket).

Now let's say the objects in bucket the are not laid out such that they could be interpreted as filesystem paths, but you'd still like to treat the bucket as if it was a filesystem (so you could do things like a "scan" operation of some sort...maybe a datasetScan?). In this case there would be one top level directory with, potentially, a ton of "files", and no subdirectories. The json return from AWS would have "Contents": [...] and that's it. The root directory would exist and would map to the cdms3:myBucket, so maybe that should be included as a directory "existing"?

Looking further down the road a bit, treating the bucket as a directory that exists would enable a user to trigger filesystem like operations that would bring their system to a scratching halt, since listing objects in a bucket without a delimiter and/or prefix can become very expensive (time-wise). For example, using the aws cli, compare aws s3api list-objects-v2 --region us-east-1 --bucket noaa-goes18 --delimiter="/" with aws s3api list-objects-v2 --region us-east-1 --bucket noaa-goes18. Just something to think about.

tdrwenski commented 1 year ago

That makes sense-- way too many objects would be listed if a bucket was considered a directory without a delimiter. Thanks @lesserwhirls :)