fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
251 stars 44 forks source link

URI query component is ignored when opening a file #92

Closed joouha closed 9 months ago

joouha commented 1 year ago

When opening a resource using universal pathlib, the query component of the URI is dropped, resulting in the wrong resource being opened.

For example, running the following:

>>> UPath("https://duckduckgo.com/?q=upath").read_text()

makes a GET request for https://duckduckgo.com/ without the query parameter, giving unexpected output.

ap-- commented 1 year ago

Hi @joouha

Currently we throw away query and fragments here: https://github.com/fsspec/universal_pathlib/blob/5d844b274d058819af6f900cd190688251a15a3b/upath/core.py#L179

But since only the first argument to UPath is parsed, there's one way to construct the path right now.

>>> upath.UPath("http://example.com/a")
HTTPPath('http://example.com/a')

>>> upath.UPath("http://example.com/", "a?q=123")
HTTPPath('http://example.com/a?q=123')

but joining on this doesn't work correctly, since it just treats it like a filename.

I think the correct way to fix this is adding query and fragment in https://github.com/fsspec/universal_pathlib/blob/5d844b274d058819af6f900cd190688251a15a3b/upath/core.py#L213 if they are present.

And then there needs to be some code that updates query and fragments whenever we join paths. (to do joins correctly this will also require fixing #88)

Would you like to work on a PR? I'll only have time to work on this in about 2-3 weeks from now.

Cheers, Andreas

drernie commented 1 year ago

@ap-- I have an urgent use case for this.

We have a commercial project that tracks different versions of S3 objects using a query parameter::

"s3://udp-spec/manual/force/README.md?versionId=B0zNSMELW__87yfYtZcfcdBw3qxHkFhm"

I know s3.open can handle it:

fo_old_version = s3.open('versioned_bucket/object', version_id='SOMEVERSIONID')

The naive solution is to pass any query parameters as kwargs to open, but I don't know if that might cause problems for others.

Should I just create a PR against this issue solve my use case, then work with you to figure out how to generalize it?

Or would it be wiser to define a general mechanism for passing "open" parameters, and manually extract the query on my side first?

ap-- commented 1 year ago

Hi @drernie,

The issue you are linking to seems to be in a private repository.

upath should already allow you to do this, if you are willing to handle the query parameter yourself:

from upath import UPath

pth = UPath("s3://udp-spec/manual/force/README.md", version_aware=True)
with pth.open(version_id="B0zNSMELW__87yfYtZcfcdBw3qxHkFhm") as f:
    data = f.read()

Let me know if that helps!

I'm currently working on a larger refactor to address a few other important issues, and would then address a common way for handling query parameters. If you'd like to contribute, I would ping you once that is ready.

Cheers, Andreas

drernie commented 1 year ago

Ouch. I just realized that UPath("s3://udp-spec/manual/force/README.md?versionId=B0zNSMELW__87yfYtZcfcdBw3qxHkFhm") drops the query string.

This seems to imply I have to manually track the version ID everywhere. Is that correct, or is there a way to attach attributes to the UPath?

drernie commented 1 year ago

Did this get somehow addressed in #125 ?