fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
211 stars 36 forks source link

S3 path resolution extra slash when joined to bucket with key #167

Closed theogaraj closed 4 months ago

theogaraj commented 6 months ago

Which operating system and Python version are you using? Windows 11, Python 3.9.6

Which version of this project are you using? 0.1.4

What did you do?

  1. Created a UPath from an S3 URI of a bucket with key suffix and trailing slash
  2. Used the / operator to create a new S3 path by joining a string to the original UPath
    >>> bucket_with_key = UPath('s3://mybucket/withkey/')   # created UPath consisting of bucket and key with trailing slash
    >>> subpath_new = bucket_with_key / 'subfolder/myfile.txt'
    >>> subpath_new
    S3Path('s3://mybucket/withkey//subfolder/myfile.txt')

What did you expect to see? I would expect to see a single slash between s3://mybucket/withkey and subfolder/myfile.txt

What did you see instead? Resultant S3Path has a double slash: S3Path('s3://mybucket/withkey//subfolder/myfile.txt')

Additional info This works as expected when I have just a bucket, or if I have bucket and key without a trailing slash Just a bucket:

>>> from upath import UPath
>>> bucketpath = UPath('s3://mybucket/')        # trailing slash with bucket only
>>> subpath = bucketpath / 'subfolder/myfile.txt'
>>> subpath
S3Path('s3://mybucket/subfolder/myfile.txt')

Bucket and key but no trailing slash:

>>> from upath import UPath
>>> bucket_with_key = UPath('s3://mybucket/withkey')  # bucket and key but no trailing slash
>>> subpath_new = bucket_with_key / 'subfolder/myfile.txt'
>>> subpath_new
S3Path('s3://mybucket/withkey/subfolder/myfile.txt')
theogaraj commented 6 months ago

While continuing to work with UPath I noticed another problem related to trailing slashes. I don't know if this has the same underlying problem as the issue I described above or if it should be its own separate thing, but here's what I'm seeing...

Attempting to glob over a directory (either local, or S3). I've defined a UPath called files_location, and I'm attempting to iterate over all the files with for filepath in files_location.glob('*.json')

ap-- commented 5 months ago

Thank you for reporting. Handling double slashes in s3 is still an open issue.

While it's supported from the s3 side, I've seen a few cases so far, where those keys were created unintentionally do to bugs in the scripts that copied the files to s3.

Nevertheless, we need to add better support for handling those cases where you want to access existing s3 buckets which are not under your control.

Would you be interested in creating a PR with a testcase for your specific issue?

Cheers, Andreas