developmentseed / obstore

Simple, fast integration with Amazon S3, Google Cloud Storage, Azure Storage, and S3-compliant APIs like Cloudflare R2
https://developmentseed.org/obstore
MIT License
159 stars 3 forks source link

Listing objects of `S3` bucket #101

Open DenisaCG opened 6 days ago

DenisaCG commented 6 days ago

Even if S3 buckets have a flat structure, they still support the folder concept to make organization easier, but when listing the objects within an S3 bucket using the obstore package, the returned list does not clearly differentiate between files and directories.

For example, the following code:

s3store = S3Store.from_url(....) // include needed parameters
stream = obs.list(s3store, chunk_size=10)
for results in stream:
    print(results)

will return the following:

 [{'path': 'folder', 'last_modified': datetime.datetime(2024, 10, 22, 11, 15, 25, tzinfo=datetime.timezone.utc), 'size': 0, 'e_tag': '"123456789"', 'version': None}, {'path': 'file.txt', 'last_modified': datetime.datetime(2024, 11, 18, 18, 47, 45, tzinfo=datetime.timezone.utc), 'size': 0, 'e_tag': '"987654321"', 'version': None}]

As directories within S3 buckets are allowed to include a . in their names, it makes it impossible to differentiate between directories and files (a folder in S3 will always be an object of size 0 as it only mimics a directory, but so will an empty file).

Typically, directories have a trailing slash to differentiate them from other objects (note the red box marked important in their docs here), but this seems to have been removed in the obstore listing.

Is there any other way that directories are supposed to be differentiated? Or is this something that can be fixed?

Thank you!

kylebarron commented 6 days ago

You can use list_with_delimiter to list files within a prefix. Does that answer your question?

DenisaCG commented 6 days ago

Thanks for your quick reply!

So, the only way to check if an object is a directory is to use the list_with_delimiter function and check if it exists within the common prefixes? I was using the list functionality as it has support for Arrow lists for larger results, but I see that is not possible with the list_with_delimiter function.

When using the list_with_delimiter function, the common prefixes returned only include directories that are not empty, so empty directories do not appear as common prefixes - I just tried it locally. Is this the expected behaviour?

kylebarron commented 6 days ago

obstore is a direct wrapper of the underlying object_store rust crate. I'm not entirely sure about some of these specifics of how the underlying crate works.

kylebarron commented 6 days ago

You might want to make an issue or ask in the discord of the arrow-rs project, which also manages object_store crate.

kylebarron commented 6 days ago

I asked on discord here: https://discord.com/channels/885562378132000778/885562378132000781/1310704559727054918