Open aucampia opened 1 year ago
It would be great to have support for abfss
too!
What's the actual work to be done here? IIUC, can we just need to parse that URI and extract the account URL and container name used by the existing implementation? You can use regular azure-storage-blob
to work with these kinds of containers? Or do we need to use another API to talk to Azure?
If we can use azure.storage.blob
, then I think we would need to
AzureBlobFileSystem.__init__
to also accept being called with this URI type. That's probably the most convenient for users but might be a bit tricky to implement (it kinda clashes with the current implementation). Maybe it'd be best to have a separate FileSystem class that handles this URI, which internally uses AzureBlobFileSystem?abfs[s]
as prefixes with fsspecFaced the same issue the other day... so amended the AzureBlobFileSystem._strip_protocol
method to be able to handle the azure blob storage host name. Here's a suggestion:
def _strip_protocol(cls, path: str):
"""
Remove the protocol from the input path
Parameters
----------
path: str
Path to remove the protocol from
Returns
-------
str
Returns a path without the protocol
"""
if isinstance(path, list):
return [cls._strip_protocol(p) for p in path]
STORE_SUFFIXES = [".blob.core.windows.net", ".dfs.core.windows.net"]
logger.debug(f"_strip_protocol for {path}")
if not path.startswith(("abfs://", "az://", "abfss://")):
path = path.lstrip("/")
path = "abfs://" + path
ops = infer_storage_options(path)
if "username" in ops:
if ops.get("username", None):
ops["path"] = ops["username"] + ops["path"]
# we need to make sure that the path retains
# the format {host}/{path}
# here host is the container_name
elif ops.get("host", None):
if (
not any(ops["host"].endswith(s) for s in STORE_SUFFIXES)
): # no store-suffix, so this is container-name
ops["path"] = ops["host"] + ops["path"]
url_query = ops.get("url_query")
if url_query is not None:
ops["path"] = f"{ops['path']}?{url_query}"
logger.debug(f"_strip_protocol({path}) = {ops}")
stripped_path = ops["path"].lstrip("/")
return stripped_path
Azure Data Lake Storage Gen2 URIs are described as follow [ref]:
This supports the storage account name in the URI, which makes it much more versatile than having to provide it out of band.
Would you be open to support for this?