fsspec / adlfs

fsspec-compatible Azure Datake and Azure Blob Storage access
BSD 3-Clause "New" or "Revised" License
178 stars 104 forks source link

Existing file marked as non-existing #265

Open lmeyerov opened 3 years ago

lmeyerov commented 3 years ago

What happened:

fs.isfile(existing_file_path) incorrectly returns False and gives a warning

EDIT: Output is

False
RuntimeWarning: coroutine 'AzureBlobFileSystem._details' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

What you expected to happen:

Return True without a warning

Minimal Complete Verifiable Example:

import adlfs, fsspec, os
storage_options = {
    'account_name': os.environ['AZ_STORAGE_ACCOUNT_NAME'],
    'account_key': os.environ['AZ_STORAGE_ACCOUNT_KEY']    
}
az_storage_container_name = os.environ['AZ_STORAGE_CONTAINER_NAME']
fs = fsspec.filesystem('abfs', **storage_options)

base_path = f'abfs://{az_storage_container_name}/data/datasets'
existing_file_path = f'{base_path}/{dataset_id}'
fs.isdir(existing_file_path)

Anything else we need to know?:

Environment:

fsspec '2021.07.0' (conda) adlfs '2021.08.1' (pip, no conda yet) docker / ubuntu 18.04 / python 3.7

lmeyerov commented 3 years ago

@hayesgb Digging a bit more, switching to asynchronous=True ... await fs._isfile(existing_file_path) does not work around the issue: the warning still triggers and the wrong result still gets returned

lmeyerov commented 3 years ago

@hayesgb (Continuing from https://github.com/dask/adlfs/issues/261)

Just tried from head:

lmeyerov commented 3 years ago

Also if it helps, my paths look like:

abfs://somecontainer/mydata/mydata2/myfile

hayesgb commented 3 years ago

Would you mind posting the result of: fs.details(“somecontainer/mydata/mydata2/abc”)

On Aug 15, 2021, at 6:07 PM, lmeyerov @.***> wrote:

somecontainer/mydata/mydata2/abc

lmeyerov commented 3 years ago

AttributeError: 'AzureBlobFileSystem' object has no attribute 'details'

lmeyerov commented 3 years ago

FYI, having more luck with variants of:

async def aexists_dir(path):
    blob_service_client = BlobServiceClient.from_connection_string(conn_str)
    async with blob_service_client:
        container_client = blob_service_client.get_container_client(az_storage_container_name)
        async for myblob in container_client.list_blobs(name_starts_with=path):
            return myblob['name'] != path
    return False
hayesgb commented 3 years ago

Thanks. I may end up updating to this. I asked about details earlier, but could you post the result of fs.info(path). Trying to create a test case for this.

lmeyerov commented 3 years ago
{
  "metadata": None,
  "creation_time": datetime.datetime(2020, 9, 29, 0, 16, 6, tzinfo=datetime.timezone.utc),
  "deleted": None,
  "deleted_time": None,
  "last_modified": datetime.datetime(2021, 8, 13, 15, 35, 35, tzinfo=datetime.timezone.utc),
  "content_settings": {
    "content_type": "application/x-gzip",
    "content_encoding": None,
    "content_language": None,
    "content_md5": bytearray(b"*****"),
    "content_disposition": None,
    "cache_control": None
  },
  "remaining_retention_days": None,
  "archive_status": None,
  "last_accessed_on": None,
  "etag": "*****",
  "tags": None,
  "tag_count": None,
  "name": "mycontainer/myfolder/myfile",
  "size": 4332,
  "type": "file"
}
hayesgb commented 3 years ago

Thanks for the help here @lmeyerov Release 2021.08.2 should fix the errors with isfile.

hayesgb commented 3 years ago

Can you share an example of the slowly downloading isdir? This does call cc.list_blobs. Are there a very large number of blobs in the location you're scanning?

lmeyerov commented 3 years ago

Yes - it's a potentially big folder (named parquet dumps), in this case I wouldn't be surprised if 1K-10K files. I think async list_files paginates, though I'm unsure of how to ensure that's reasonably small. That's part of the reason we're trying to only do asyncio w/ adlfs, ensuring even occasional blips will not starve out other tasks.

hayesgb commented 3 years ago

@lmeyerov -- I just refactored _isdir on the accel_isdir branch. It passes all the tests, and completely eliminates the list_blobs call. Would appreciate your feedback if you have a chance to check it out.

lmeyerov commented 3 years ago

Sure -- will check on Th/F (am traveling)

At the same time, if anything around async multi-connection downloads of indiv + folder blobs, happy to check there. Currently investigating how to do via az's SDK, but we rather have unified under fsspec!

hayesgb commented 3 years ago

Cool. Just curious -- on the multi-connection downloads -- are you looking to use Dask or is the use case async multithreading?

lmeyerov commented 3 years ago
  1. Currently single-node / multicore . Our Azure GPU VMs have something like 2-8 NICs with 8-32 Gbps, and I think AWS/GCP end up similar, so focusing on saturating abfs => SSD writes with that. Multi-node may be interesting early next year, but not there yet :)

RE:async multithreading, az sdk has parallel connection support with a configurable # of streams, which seems like a fine first step..

  1. Our other common use case is when we read directly from dask_cudf.read_parquet, and it may have some funny NUMA behavior to consider for remote reads, but not sure yet. Local reads are via GPU Direct Storage, and I believe there may be network extensions for GPU Direct as well....