fsspec / adlfs

fsspec-compatible Azure Datake and Azure Blob Storage access
BSD 3-Clause "New" or "Revised" License
175 stars 104 forks source link

flexible url handling #259

Open umashankark opened 3 years ago

umashankark commented 3 years ago

AzureBlobFileSystem._strip_protocol('abfs://container/path-part/file') -> returns: 'container/path-part/file' AzureBlobFileSystem._strip_protocol('abfs://container@account.dfs.core.windows.net/path-part/file') -> returns: 'account.dfs.core.windows.net/path-part/file' - where 'container/path-part/file' needs to be returned.

Supporting above return pattern, will help applications (say, that work with Spark & fsspec,) use same URL for data access.

umashankark commented 3 years ago

This _strip_protocol() implementation handles such inputs:

        STORE_SUFFIX = '.dfs.core.windows.net'
        if not path.startswith('abfs://'):
            path.lstrip("/")
            path = 'abfs://' + path
        ops = infer_storage_options(path)
        if "username" in ops:
            if ops.get("username", None):
                ops["path"] = ops["username"] + ops["path"]
        elif ops.get("host", None):
            if ops["host"].count(STORE_SUFFIX) == 0: #no store-suffix, so this is container-name
                ops["path"] = ops["host"] + ops["path"]
        return ops["path"]

Please let me know if a PR can be created with above change.

hayesgb commented 3 years ago

The above would be a welcome improvement. I would revise line #2 as follows to support using the "az://" as well.

if not path.startswith(("abfs://", "az://")):

It would be great if you included a unit test to validate the use case included in the proposed fix as well.

umashankark commented 3 years ago

Sure, @hayesgb. Will raise a PR.

lostmygithubaccount commented 3 years ago

to address a related issue - translating between Spark and Pandas/Dask - any objection to adding an alias abfss in addition to az?

hayesgb commented 3 years ago

I was thinking about this a little more. fsspec implements a _get_kwargs_from_url which might be ideal here https://github.com/intake/filesystem_spec/blob/ee22435bc57bd9158103415c5fc58c3cbdddebf2/fsspec/spec.py#L199

hayesgb commented 3 years ago

I’m open to adding it. Just curious as to why we would need “abfss://“ in addition to the existing protocols?

On Jul 26, 2021, at 10:20 PM, Cody @.***> wrote:

 to address a related issue - translating between Spark and Pandas/Dask - any objection to adding an alias abfss in addition to az?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

umashankark commented 3 years ago

I was thinking about this a little more. fsspec implements a _get_kwargs_from_url which might be ideal here https://github.com/intake/filesystem_spec/blob/ee22435bc57bd9158103415c5fc58c3cbdddebf2/fsspec/spec.py#L199

Can you please elaborate this idea ?

hayesgb commented 3 years ago

I added this into master with #271 . Also @lostmygithubaccount -- I added the ability to register the abfss protocol by importing the package into the local namespace, but it will take a PR to fsspec to have the abfss protocol registered there.

Out of curiosity, is there an interest in using adlfs for Spark with Azure, or is this more about improving cross-code compatibility between Dask and Spark?

umashankark commented 3 years ago

thanks for the update @hayesgb.