Open lmeyerov opened 3 years ago
@hayesgb Digging a bit more, switching to asynchronous=True ... await fs._isfile(existing_file_path)
does not work around the issue: the warning still triggers and the wrong result still gets returned
@hayesgb (Continuing from https://github.com/dask/adlfs/issues/261)
Just tried from head:
isfile()
is quickly & incorrectly returning False
; no async warning anymoreisdir()
is slowly but correctly returning True
; I suspect it is downloading the foldersisfile()
quickly & correctly returning False
isdir()
quickly & correctly returning False
Also if it helps, my paths look like:
abfs://somecontainer/mydata/mydata2/myfile
Would you mind posting the result of: fs.details(“somecontainer/mydata/mydata2/abc”)
On Aug 15, 2021, at 6:07 PM, lmeyerov @.***> wrote:
somecontainer/mydata/mydata2/abc
AttributeError: 'AzureBlobFileSystem' object has no attribute 'details'
FYI, having more luck with variants of:
async def aexists_dir(path):
blob_service_client = BlobServiceClient.from_connection_string(conn_str)
async with blob_service_client:
container_client = blob_service_client.get_container_client(az_storage_container_name)
async for myblob in container_client.list_blobs(name_starts_with=path):
return myblob['name'] != path
return False
Thanks. I may end up updating to this. I asked about details earlier, but could you post the result of fs.info(path)
. Trying to create a test case for this.
{
"metadata": None,
"creation_time": datetime.datetime(2020, 9, 29, 0, 16, 6, tzinfo=datetime.timezone.utc),
"deleted": None,
"deleted_time": None,
"last_modified": datetime.datetime(2021, 8, 13, 15, 35, 35, tzinfo=datetime.timezone.utc),
"content_settings": {
"content_type": "application/x-gzip",
"content_encoding": None,
"content_language": None,
"content_md5": bytearray(b"*****"),
"content_disposition": None,
"cache_control": None
},
"remaining_retention_days": None,
"archive_status": None,
"last_accessed_on": None,
"etag": "*****",
"tags": None,
"tag_count": None,
"name": "mycontainer/myfolder/myfile",
"size": 4332,
"type": "file"
}
Thanks for the help here @lmeyerov Release 2021.08.2 should fix the errors with isfile.
Can you share an example of the slowly downloading isdir? This does call cc.list_blobs. Are there a very large number of blobs in the location you're scanning?
Yes - it's a potentially big folder (named parquet dumps), in this case I wouldn't be surprised if 1K-10K files. I think async list_files paginates, though I'm unsure of how to ensure that's reasonably small. That's part of the reason we're trying to only do asyncio w/ adlfs, ensuring even occasional blips will not starve out other tasks.
@lmeyerov -- I just refactored _isdir on the accel_isdir branch. It passes all the tests, and completely eliminates the list_blobs call. Would appreciate your feedback if you have a chance to check it out.
Sure -- will check on Th/F (am traveling)
At the same time, if anything around async multi-connection downloads of indiv + folder blobs, happy to check there. Currently investigating how to do via az's SDK, but we rather have unified under fsspec!
Cool. Just curious -- on the multi-connection downloads -- are you looking to use Dask or is the use case async multithreading?
RE:async multithreading, az sdk has parallel connection support with a configurable # of streams, which seems like a fine first step..
dask_cudf.read_parquet
, and it may have some funny NUMA behavior to consider for remote reads, but not sure yet. Local reads are via GPU Direct Storage, and I believe there may be network extensions for GPU Direct as well....
What happened:
fs.isfile(existing_file_path)
incorrectly returns False and gives a warningEDIT: Output is
What you expected to happen:
Return
True
without a warningMinimal Complete Verifiable Example:
Anything else we need to know?:
Environment:
fsspec '2021.07.0' (conda) adlfs '2021.08.1' (pip, no conda yet) docker / ubuntu 18.04 / python 3.7