fsspec / adlfs

fsspec-compatible Azure Datake and Azure Blob Storage access
BSD 3-Clause "New" or "Revised" License
179 stars 104 forks source link

Recursive remove do not work with newest versions #389

Closed PeterFogh closed 10 months ago

PeterFogh commented 1 year ago

Hi, my old conda environment:

adlfs                     2022.11.2          pyhd8ed1ab_0    conda-forge
azure-core                1.26.1             pyhd8ed1ab_0    conda-forge
azure-datalake-store      0.0.51             pyh9f0ad1d_0    conda-forge
azure-identity            1.12.0             pyhd8ed1ab_0    conda-forge
azure-storage-blob        12.14.1            pyhd8ed1ab_0    conda-forge
fsspec                    2022.10.0       py310haa95532_0

runs the following without any error

import adlfs
import azure.identity.aio
AZURE_FS = adlfs.AzureBlobFileSystem(
    account_name=STORAGE_ACCOUNT,
    credential=azure.identity.aio.DefaultAzureCredential())
AZURE_FS.rm(PATH, recursive=True)

But after updating to these package versions:

adlfs                     2023.1.0           pyhd8ed1ab_0    conda-forge
azure-core                1.26.2             pyhd8ed1ab_0    conda-forge
azure-datalake-store      0.0.51             pyh9f0ad1d_0    conda-forge
azure-identity            1.12.0             pyhd8ed1ab_0    conda-forge
azure-storage-blob        12.13.1            pyhd8ed1ab_0    conda-forge
fsspec                    2022.11.0       py310haa95532_0

it raises this error: RuntimeError: ('Failed to remove %s for %s', [PATHS], ResourceExistsError('This operation is not permitted on a non-empty directory.\nRequestId:ID\nTime:2023-01-20T09:33:26.1693752Z\nErrorCode:DirectoryIsNotEmpty'))

Bacially, the recursive=True do not work as intended.

efiop commented 1 year ago

CC @daavoo in case it is related to https://github.com/fsspec/adlfs/pull/383

daavoo commented 1 year ago

@PeterFogh , what is PATH in the snippet?

PeterFogh commented 1 year ago

@daavoo - I have tried both direct paths like "container/blob" and protocal paths link "az://container/blob" - but both fail the same way.

daavoo commented 1 year ago

Hi @PeterFogh , I am unable to reproduce. Does the dir have any particular structure? Do you call rm with multiple args?

Apart from the existing recursive test (https://github.com/fsspec/adlfs/blob/main/adlfs/tests/test_spec.py#L724) which passes locally and on CI, I have tried a quick script locally (trying different ways of removing subdirs) on a fresh venv and actual bucket:

Output of `pip list` ```console $ pip list Package Version -------------------- --------- adal 1.2.7 adlfs 2023.1.0 aiohttp 3.8.3 aiosignal 1.3.1 async-timeout 4.0.2 attrs 22.2.0 azure-core 1.26.2 azure-datalake-store 0.0.52 azure-identity 1.12.0 azure-storage-blob 12.14.1 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 2.1.1 cryptography 39.0.0 frozenlist 1.3.3 fsspec 2023.1.0 gitdb 4.0.10 GitPython 3.1.29 idna 3.4 isodate 0.6.1 msal 1.20.0 msal-extensions 1.0.0 msrest 0.7.1 multidict 6.0.4 oauthlib 3.2.2 pip 21.1.1 portalocker 2.7.0 pycparser 2.21 PyJWT 2.6.0 python-dateutil 2.8.2 requests 2.28.2 requests-oauthlib 1.3.1 setuptools 56.0.0 six 1.16.0 smmap 5.0.0 typing-extensions 4.4.0 urllib3 1.26.14 yarl 1.8.2 ```
`test_rm.py` ```python from pathlib import Path from adlfs import AzureBlobFileSystem fs = AzureBlobFileSystem() Path("foofile.txt").write_text("foo") print("INITIAL LS", fs.ls("test-rm", recursive=True)) fs.mkdir("test-rm/foodir") fs.put_file("foofile.txt", "test-rm/foodir/foofile.txt") print("AFTER PUT FILE", fs.ls("test-rm", recursive=True)) fs.rm("test-rm/foodir", recursive=True) print("AFTER RM foodir", fs.ls("test-rm", recursive=True)) fs.mkdir("test-rm/foo/dir") fs.put_file("foofile.txt", "test-rm/foo/dir/foofile.txt") print("AFTER PUT FILE", fs.ls("test-rm", recursive=True)) fs.rm("test-rm/foo/dir", recursive=True) print("AFTER RM FOO/DIR", fs.ls("test-rm", recursive=True)) fs.mkdir("test-rm/foo/dir") fs.put_file("foofile.txt", "test-rm/foo/dir/foofile.txt") print("AFTER PUT FILE", fs.ls("test-rm", recursive=True)) fs.rm("test-rm/foo", recursive=True) print("AFTER RM FOO", fs.ls("test-rm", recursive=True)) ```
Output of `python test_rm.py` ```console $ python test_rm.py INITIAL LS [] AFTER PUT FILE ['test-rm/foodir'] AFTER RM foodir [] AFTER PUT FILE ['test-rm/foo'] AFTER RM FOO/DIR [] AFTER PUT FILE ['test-rm/foo'] AFTER RM FOO [] ```
PeterFogh commented 1 year ago

@daavoo - I still get the error, but I can see that we differ in fsspec version. because mine is fsspec 2022.11.0 py310haa95532_0, which do not match with the adlfs version 2023.1.0

After I forced the fsspec version to conda install fsspec=2023.1.0 the code can delete the folder recursively without the error :)

Rigth now, I'm solving a new conda environment to see if the versions are compatiable without any versions specifications.

name: py310_readings_4
channels:
  - defaults
  - conda-forge
dependencies:
  - distributed
  - dask
  - adlfs
  - ipykernel
  - matplotlib
  - python=3.10
  - pyarrow
  - pandas

Still solves to missmatching adlfs and fsspec verisons:

$ conda list | grep -E "adlfs|fsspec|dask"
adlfs                     2023.1.0           pyhd8ed1ab_0    conda-forge
dask                      2022.7.0        py310haa95532_0  
dask-core                 2022.7.0        py310haa95532_0
fsspec                    2022.11.0       py310haa95532_0
daavoo commented 1 year ago

Rigth now, I'm solving a new conda environment to see if the versions are compatiable without any versions specifications.

I think the problem might be in the defaults channel of conda.

Changing your file to use only conda-forge channel works for me:

$ conda list | grep -E "adlfs|fsspec|dask"
adlfs                     2023.1.0           pyhd8ed1ab_0    conda-forge
dask                      2023.1.0           pyhd8ed1ab_0    conda-forge
dask-core                 2023.1.0           pyhd8ed1ab_0    conda-forge
fsspec                    2023.1.0           pyhd8ed1ab_0    conda-forge
igorng commented 1 year ago

Hello. I am having the exact same issue. What is the proper mitigation procedure please?

daavoo commented 1 year ago

Hello. I am having the exact same issue. What is the proper mitigation procedure please?

Hi @igorng , how did you install adlfs? via conda?

igorng commented 1 year ago

@daavoo adlfs is installed as part of transitive dep of a package, which is installed wit pip

daavoo commented 1 year ago

@daavoo adlfs is installed as part of transitive dep of a package, which is installed wit pip

Could you share pip list? Or just check the fsspec version?

As in my comment , it works for me with latest adlfs and fsspec versions, there might be some package in the dependency list that is causing you to install an older version of fsspec

igorng commented 1 year ago

For some reason, when I do not put a constraint on the version to use, or when I set it to the latest 2023.1.0, fsspec seems to be taken from conda-forge, while adlfs comes from pypi, and it does not work

$ conda list | grep -E 'adlfs|fsspec'
adlfs                     2023.1.0                 pypi_0                 pypi
...
fsspec                  2023.1.0                 pyhd8ed1ab_0    conda-forge

When I set version to 2022.11.0, both are from pypi, and it works.

$ conda list | grep -E  "adlfs|fsspec"
adlfs                       2022.11.0                pypi_0    pypi
fsspec                    2022.11.0                 pypi_0    pypi

So I downgraded to 2022.11.0 .

Edit: I don't have time to investigate why fsspec 2023.1.0 is coming from conda-forge (complex env here), os since 2022.11.0 works, fine by me ;)

Thank you!

nosterlu commented 1 year ago

I have the same problem with latest adlfs and fsspec. In the Azure container it is maybe 4 folders deep, with parquet files in the last folder. But it does not always throw an error, it depends on the folder structure.

For example, first time I run this code it only succeeds to delete 2 out of 3 folders with similar depths image

Next time I run this code (to try and delete the remaining folder I get this error.

fs = LakeHouse.Install_Base.fs
folders = fs.glob(CONTAINER_INSTALL_BASE_HERCULES + "/*")
print("folders before delete")
print(folders)
fs.rm(folders, recursive=True)
folders = fs.glob(CONTAINER_INSTALL_BASE_HERCULES + "/*")
print("folders left after delete")
print(folders)
folders before delete
['install-base-from-hercules-standard/vehicle_type=536']
Traceback (most recent call last):

  File "C:\temp\Apps\Anaconda64\lib\site-packages\adlfs\spec.py", line 1252, in _rm
    await self._rm_files(container_name, files)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\adlfs\spec.py", line 1281, in _rm_files
    raise ex

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\core\tracing\decorator_async.py", line 79, in wrapper_use_tracer
    return await func(*args, **kwargs)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\storage\blob\aio\_container_client_async.py", line 972, in delete_blob
    await blob.delete_blob( # type: ignore

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\core\tracing\decorator_async.py", line 79, in wrapper_use_tracer
    return await func(*args, **kwargs)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\storage\blob\aio\_blob_client_async.py", line 600, in delete_blob
    process_storage_error(error)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\storage\blob\_shared\response_handlers.py", line 185, in process_storage_error
    exec("raise error from None")   # pylint: disable=exec-used # nosec

  File "<string>", line 1, in <module>

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\storage\blob\aio\_blob_client_async.py", line 598, in delete_blob
    await self._client.blob.delete(**options)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\core\tracing\decorator_async.py", line 79, in wrapper_use_tracer
    return await func(*args, **kwargs)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\storage\blob\_generated\aio\operations\_blob_operations.py", line 685, in delete
    map_error(status_code=response.status_code, response=response, error_map=error_map)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\azure\core\exceptions.py", line 110, in map_error
    raise error

ResourceExistsError: This operation is not permitted on a non-empty directory.
RequestId:ece2ab15-001e-0002-532b-3747c0000000
Time:2023-02-02T17:23:17.7579894Z
ErrorCode:DirectoryIsNotEmpty
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>DirectoryIsNotEmpty</Code><Message>This operation is not permitted on a non-empty directory.
RequestId:ece2ab15-001e-0002-532b-3747c0000000
Time:2023-02-02T17:23:17.7579894Z</Message></Error>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "C:\Users\nosterlu\AppData\Local\Temp\ipykernel_27744\3000828452.py", line 5, in <cell line: 5>
    fs.rm(folders, recursive=True)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\fsspec\asyn.py", line 114, in wrapper
    return sync(self.loop, func, *args, **kwargs)

  File "C:\temp\Apps\Anaconda64\lib\site-packages\fsspec\asyn.py", line 99, in sync
    raise return_result

  File "C:\temp\Apps\Anaconda64\lib\site-packages\fsspec\asyn.py", line 54, in _runner
    result[0] = await coro

  File "C:\temp\Apps\Anaconda64\lib\site-packages\adlfs\spec.py", line 1258, in _rm
    raise RuntimeError("Failed to remove %s for %s", path, e)

RuntimeError: ('Failed to remove %s for %s', ['install-base-from-hercules-standard/vehicle_type=534', 'install-base-from-hercules-standard/vehicle_type=536', 'install-base-from-hercules-standard/vehicle_type=536/', 'install-base-from-hercules-standard/vehicle_type=536/model_year=2024', 'install-base-from-hercules-standard/vehicle_type=539', 'install-base-from-hercules-standard/vehicle_type=539/', 'install-base-from-hercules-standard/vehicle_type=539/vehicle_type=536', 'install-base-from-hercules-standard/vehicle_type=539/vehicle_type=539'], ResourceExistsError('This operation is not permitted on a non-empty directory.\nRequestId:ece2ab15-001e-0002-532b-3747c0000000\nTime:2023-02-02T17:23:17.7579894Z\nErrorCode:DirectoryIsNotEmpty'))

This is my current setup with pip list. (I have only installed via pip)

APScheduler                                      3.8.1
argon2-cffi                                      21.3.0
argon2-cffi-bindings                             21.2.0
args                                             0.1.0
arrow                                            1.2.3
art                                              5.3
arviz                                            0.11.2
asn1crypto                                       1.5.1
astroid                                          2.12.10
astropy                                          5.1
asttokens                                        2.0.5
async-timeout                                    4.0.1
atomicwrites                                     1.4.1
atpublic                                         3.1.1
attrs                                            22.1.0
Authlib                                          1.2.0
Automat                                          20.2.0
autopep8                                         1.6.0
azure-batch                                      13.0.0
azure-common                                     1.1.27
azure-core                                       1.26.2
azure-datalake-store                             0.0.52
azure-functions                                  1.12.0
azure-identity                                   1.12.0
azure-keyvault                                   4.2.0
azure-keyvault-certificates                      4.6.0
azure-keyvault-keys                              4.7.0
azure-keyvault-secrets                           4.6.0
azure-nspkg                                      3.0.2
azure-storage                                    0.36.0
azure-storage-blob                               12.14.1
azure-storage-file-datalake                      12.9.1
Babel                                            2.10.3
backcall                                         0.2.0
backports.functools-lru-cache                    1.6.4
backports.tempfile                               1.0
backports.weakref                                1.0.post1
bcrypt                                           4.0.0
beautifulsoup4                                   4.11.1
binaryornot                                      0.4.4
bitarray                                         2.5.1
bkcharts                                         0.2
black                                            22.8.0
bleach                                           5.0.1
bokeh                                            2.4.3
boto3                                            1.24.28
botocore                                         1.27.28
Bottleneck                                       1.3.5
brotlipy                                         0.7.0
cachetools                                       4.2.4
catboost                                         1.0.3
certifi                                          2022.9.24
cffi                                             1.15.1
cftime                                           1.5.1.1
chardet                                          5.0.0
charset-normalizer                               2.1.1
click                                            8.1.3
clint                                            0.5.1
cloudpickle                                      2.2.0
clyent                                           1.2.2
cmdstanpy                                        0.9.68
colorama                                         0.4.5
colorcet                                         3.0.0
commonmark                                       0.9.1
comtypes                                         1.1.10
conda                                            22.9.0
conda-build                                      3.22.0
conda-content-trust                              0.1.3
conda-pack                                       0.6.0
conda-package-handling                           1.8.1
conda-repo-cli                                   1.0.5
conda-token                                      0.3.0
conda-verify                                     3.4.2
constantly                                       15.1.0
convertdate                                      2.3.2
cookiecutter                                     2.1.1
coverage                                         6.4.4
cramjam                                          2.5.0
cryptography                                     36.0.2
cssselect                                        1.1.0
cycler                                           0.11.0
Cython                                           0.29.30
cytoolz                                          0.11.0
daal4py                                          2021.5.0
dask                                             2023.1.0
dataprep                                         0.4.5
datashader                                       0.14.1
datashape                                        0.5.4
dateutils                                        0.6.12
debugpy                                          1.6.3
decorator                                        5.1.1
defusedxml                                       0.7.1
Deprecated                                       1.2.13
diff-match-patch                                 20200713
dill                                             0.3.5.1
distributed                                      2023.1.0
docstring-parser                                 0.13
docutils                                         0.19
duckdb                                           0.6.1
duckdb-engine                                    0.6.4
entrypoints                                      0.4
ephem                                            4.1.2
et-xmlfile                                       1.1.0
executing                                        0.8.3
fastjsonschema                                   2.16.2
fastparquet                                      0.8.3
filelock                                         3.6.0
fire                                             0.4.0
flake8                                           4.0.1
Flask                                            2.2.2
Flask-Cors                                       3.0.10
Flask-Login                                      0.6.2
Flask-OAuth                                      0.12
Flask-SQLAlchemy                                 2.5.1
fonttools                                        4.25.0
frozenlist                                       1.2.0
fsspec                                           2023.1.0
future                                           0.18.2
gensim                                           4.1.2
geographiclib                                    1.52
geopy                                            2.2.0
glob2                                            0.7
google-api-core                                  2.2.2
google-api-python-client                         1.12.8
google-auth                                      2.6.0
google-auth-httplib2                             0.1.0
google-cloud-core                                2.2.2
google-cloud-storage                             1.43.0
google-crc32c                                    1.3.0
google-resumable-media                           2.1.0
googleapis-common-protos                         1.58.0
googlemaps                                       4.6.0
graphviz                                         0.19.1
greenlet                                         1.1.1
grpcio                                           1.42.0
h11                                              0.13.0
h5py                                             3.7.0
HeapDict                                         1.0.1
hijri-converter                                  2.2.2
holidays                                         0.13
holoviews                                        1.15.0
html5lib                                         1.1
httplib2                                         0.20.2
hvplot                                           0.8.0
hyperlink                                        21.0.0
ibis                                             3.2.0
ibis-framework                                   3.2.0
idna                                             3.4
imagecodecs                                      2021.8.26
imageio                                          2.19.3
imagesize                                        1.4.1
importlib-metadata                               4.12.0
incremental                                      21.3.0
inflection                                       0.5.1
iniconfig                                        1.1.1
intake                                           0.6.5
intervaltree                                     3.1.0
ipykernel                                        6.16.0
ipython                                          7.34.0
ipython-genutils                                 0.2.0
ipywidgets                                       7.6.5
isodate                                          0.6.0
isort                                            5.10.1
itemadapter                                      0.3.0
itemloaders                                      1.0.4
itsdangerous                                     2.1.2
jaraco.classes                                   3.2.3
jdcal                                            1.4.1
jedi                                             0.18.1
jellyfish                                        0.9.0
Jinja2                                           3.0.3
jinja2-time                                      0.2.0
jmespath                                         0.10.0
joblib                                           1.1.0
json5                                            0.9.6
jsonpath-ng                                      1.5.3
jsonschema                                       4.16.0
jupyter                                          1.0.0
jupyter_client                                   7.3.5
jupyter-console                                  6.4.3
jupyter-core                                     4.11.1
jupyter-server                                   1.18.1
jupyterlab                                       3.4.4
jupyterlab-pygments                              0.2.2
jupyterlab-server                                2.10.3
jupyterlab-widgets                               1.0.0
keyring                                          23.9.3
kfp                                              1.8.10
kfp-pipeline-spec                                0.1.13
kfp-server-api                                   1.7.1
kiwisolver                                       1.4.2
korean-lunar-calendar                            0.2.1
kubernetes                                       18.20.0
lazy-object-proxy                                1.7.1
libarchive-c                                     2.9
libmambapy                                       0.25.0
lightgbm                                         3.3.3
line-profiler                                    3.5.1
llvmlite                                         0.39.1
locket                                           1.0.0
LunarCalendar                                    0.0.9
lxml                                             4.9.1
lz4                                              3.1.3
mamba                                            0.25.0
Markdown                                         3.3.4
MarkupSafe                                       2.1.1
matplotlib                                       3.5.2
matplotlib-inline                                0.1.6
mccabe                                           0.6.1
menuinst                                         1.4.18
Metaphone                                        0.6
mistune                                          2.0.4
mkl-fft                                          1.3.1
mkl-random                                       1.2.2
mkl-service                                      2.4.0
mock                                             4.0.3
more-itertools                                   8.14.0
mpmath                                           1.2.1
msal                                             1.16.0
msal-extensions                                  0.3.0
msgpack                                          1.0.4
msrest                                           0.7.1
msrestazure                                      0.6.4
multidict                                        5.2.0
multipledispatch                                 0.6.0
munkres                                          1.1.4
mypy-extensions                                  0.4.3
navigator-updater                                0.2.1
nbclassic                                        0.3.5
nbclient                                         0.6.8
nbconvert                                        7.0.0
nbformat                                         5.6.1
nest-asyncio                                     1.5.5
netCDF4                                          1.5.7
networkx                                         2.8.4
nltk                                             3.7
nodejs                                           0.1.1
nose                                             1.3.7
notebook                                         6.4.12
npm                                              0.1.1
numba                                            0.56.2
numexpr                                          2.8.3
numpy                                            1.22.4
numpydoc                                         1.4.0
O365                                             2.0.16
oauth2                                           1.9.0.post1
oauthlib                                         3.1.1
olefile                                          0.46
opencensus                                       0.11.0
opencensus-context                               0.1.3
opencensus-ext-azure                             1.1.7
openpyxl                                         3.0.10
optional-django                                  0.1.0
oscrypto                                         1.2.1
outcome                                          1.1.0
p3270                                            0.1.3
packaging                                        21.3
pandas                                           1.4.2
pandocfilters                                    1.5.0
panel                                            0.13.1
param                                            1.12.0
paramiko                                         2.11.0
parsel                                           1.6.0
parso                                            0.8.3
parsy                                            2.0
partd                                            1.3.0
pathlib                                          1.0.1
pathspec                                         0.10.1
patsy                                            0.5.2
pep8                                             1.7.1
pexpect                                          4.8.0
pickleshare                                      0.7.5
Pillow                                           8.4.0
pip                                              22.3.1
pkginfo                                          1.8.2
platformdirs                                     2.5.2
plotly                                           5.9.0
pluggy                                           1.0.0
ply                                              3.11
polars                                           0.14.25
portalocker                                      1.7.1
poyo                                             0.5.0
prometheus-client                                0.14.1
prompt-toolkit                                   3.0.31
prophet                                          1.0.1
Protego                                          0.1.16
protobuf                                         3.20.3
psutil                                           5.9.2
ptyprocess                                       0.7.0
pure-eval                                        0.2.2
py                                               1.11.0
pyan3                                            1.1.1
pyarrow                                          8.0.0
pyasn1                                           0.4.8
pyasn1-modules                                   0.2.8
pybind11                                         2.10.0
pycallgraph2                                     1.1.3
pycodestyle                                      2.8.0
pycosat                                          0.6.3
pycparser                                        2.21
pycryptodome                                     3.11.0
pycryptodomex                                    3.11.0
pyct                                             0.4.8
pycurl                                           7.45.1
pydantic                                         1.10.2
PyDispatcher                                     2.0.5
pydocstyle                                       6.1.1
pydot                                            1.4.2
pyerfa                                           2.0.0
pyflakes                                         2.4.0
PyGithub                                         1.55
Pygments                                         2.13.0
PyHamcrest                                       2.0.2
PyJWT                                            2.4.0
pylint                                           2.15.3
pyls-spyder                                      0.4.0
pymannkendall                                    1.4.2
PyMeeus                                          0.5.11
PyNaCl                                           1.5.0
pyodbc                                           4.0.34
pyOpenSSL                                        21.0.0
pyparsing                                        3.0.9
PyQt5                                            5.15.7
PyQt5-Qt5                                        5.15.2
PyQt5-sip                                        12.11.0
PyQtChart                                        5.12
PyQtWebEngine                                    5.15.6
PyQtWebEngine-Qt5                                5.15.2
pyreadline                                       2.1
pyrsistent                                       0.18.1
PySocks                                          1.7.1
pystan                                           2.19.1.1
pytest                                           7.1.2
python-crfsuite                                  0.9.8
python-dateutil                                  2.8.2
python-dotenv                                    0.19.2
python-lsp-black                                 1.2.1
python-lsp-jsonrpc                               1.0.0
python-lsp-server                                1.5.0
python-slugify                                   6.1.2
python-snappy                                    0.6.0
python-stdnum                                    1.17
pytoolconfig                                     1.2.2
pytz                                             2022.2.1
pytz-deprecation-shim                            0.1.0.post0
pyviz-comms                                      2.0.2
PyWavelets                                       1.3.0
pywin32                                          305
pywin32-ctypes                                   0.2.0
pywinpty                                         2.0.2
PyYAML                                           6.0
pyzmq                                            24.0.1
QDarkStyle                                       3.0.3
qstylizer                                        0.2.2
QtAwesome                                        1.1.1
qtconsole                                        5.3.2
QtPy                                             2.2.0
queuelib                                         1.5.0
rapidfuzz                                        2.13.2
regex                                            2021.11.10
requests                                         2.28.1
requests-file                                    1.5.1
requests-oauthlib                                1.3.0
requests-toolbelt                                0.9.1
rich                                             12.5.1
rope                                             1.3.0
rsa                                              4.8
Rtree                                            1.0.0
ruamel-yaml-conda                                0.15.100
s3transfer                                       0.6.0
scikit-image                                     0.19.2
scikit-learn                                     1.1.1
scikit-learn-intelex                             2021.20220215.102710
scipy                                            1.9.3
Scrapy                                           2.6.2
seaborn                                          0.11.2
selenium                                         4.1.3
Send2Trash                                       1.8.0
service-identity                                 18.1.0
setuptools                                       59.8.0
setuptools-git                                   1.2
sip                                              4.19.13
six                                              1.16.0
smart-open                                       5.2.1
sniffio                                          1.2.0
snowballstemmer                                  2.2.0
snowflake                                        0.0.3
snowflake-connector-python                       2.9.0
sortedcollections                                2.1.0
sortedcontainers                                 2.4.0
soupsieve                                        2.3.2.post1
Sphinx                                           5.2.2
sphinxcontrib-applehelp                          1.0.2
sphinxcontrib-devhelp                            1.0.2
sphinxcontrib-htmlhelp                           2.0.0
sphinxcontrib-jsmath                             1.0.1
sphinxcontrib-qthelp                             1.0.3
sphinxcontrib-serializinghtml                    1.1.5
spyder                                           5.3.3
spyder-kernels                                   2.3.3
SQLAlchemy                                       1.3.24
sqlglot                                          6.2.6
stack-data                                       0.2.0
statsmodels                                      0.13.2
stringcase                                       1.2.0
strip-hints                                      0.1.10
style                                            1.1.0
sympy                                            1.10.1
tables                                           3.6.1
tabulate                                         0.8.10
TBB                                              0.2
tblib                                            1.7.0
tenacity                                         8.0.1
teradatasql                                      17.10.0.7
teradatasqlalchemy                               17.0.0.3
termcolor                                        1.1.0
terminado                                        0.13.1
testpath                                         0.6.0
text-unidecode                                   1.3
textdistance                                     4.5.0
threadpoolctl                                    2.2.0
three-merge                                      0.1.1
thrift                                           0.16.0
tifffile                                         2021.7.2
tinycss                                          0.4
tinycss2                                         1.1.1
tldextract                                       3.2.0
toml                                             0.10.2
tomli                                            2.0.1
tomlkit                                          0.11.5
toolz                                            0.12.0
tornado                                          6.2
tqdm                                             4.64.0
traitlets                                        5.4.0
trio                                             0.20.0
trio-websocket                                   0.9.2
Twisted                                          22.2.0
twisted-iocpsupport                              1.0.2
typed-ast                                        1.4.3
typer                                            0.4.0
typing_extensions                                4.3.0
tzdata                                           2021.5
tzlocal                                          2.1
ujson                                            5.5.0
Unidecode                                        1.2.0
update                                           0.0.1
uritemplate                                      3.0.1
urllib3                                          1.26.12
varname                                          0.8.3
w3lib                                            1.21.0
waitress                                         2.1.2
watchdog                                         2.1.9
wcwidth                                          0.2.5
webencodings                                     0.5.1
websocket-client                                 1.2.3
Werkzeug                                         2.2.2
whatthepatch                                     1.0.2
wheel                                            0.37.1
widgetsnbextension                               3.5.2
win-inet-pton                                    1.1.0
win-unicode-console                              0.5
wincertstore                                     0.2
wordcloud                                        1.8.2.2
wrapt                                            1.14.1
wsproto                                          1.1.0
xarray                                           0.20.1
xgboost                                          1.7.1
xlrd                                             2.0.1
XlsxWriter                                       3.0.3
xlwings                                          0.24.9
xmltodict                                        0.12.0
yapf                                             0.32.0
yarl                                             1.8.1
zict                                             2.2.0
zipp                                             3.8.1
zope.interface                                   5.4.0
daavoo commented 1 year ago

Thanks for the details @nosterlu ! I am going to try to reproduce it

cjalmeida commented 1 year ago

Bumping as I'm having the same issue.

calling fs.rm(path, recursive=True) where path is something like container_name/folder with 4 files. I get

File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/adlfs/spec.py", line 1259, in _rm
    await self._rm_files(container_name, files)
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/adlfs/spec.py", line 1288, in _rm_files
    raise ex
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/storage/blob/aio/_container_client_async.py", line 1035, in delete_blob
    await blob.delete_blob( # type: ignore
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/storage/blob/aio/_blob_client_async.py", line 618, in delete_blob
    process_storage_error(error)
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/storage/blob/_shared/response_handlers.py", line 189, in process_storage_error
    exec("raise error from None")   # pylint: disable=exec-used # nosec
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 1, in <module>
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/storage/blob/aio/_blob_client_async.py", line 616, in delete_blob
    await self._client.blob.delete(**options)
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/storage/blob/_generated/aio/operations/_blob_operations.py", line 691, in delete
    map_error(status_code=response.status_code, response=response, error_map=error_map)
  File "/home/cjalmeida/work/myproject/.venv/lib/python3.11/site-packages/azure/core/exceptions.py", line 164, in map_error
    raise error
azure.core.exceptions.ResourceExistsError: This operation is not permitted on a non-empty directory.
RequestId:f3042353-701e-0069-7a68-dc7b41000000
Time:2023-09-01T00:08:11.0327332Z
ErrorCode:DirectoryIsNotEmpty
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>DirectoryIsNotEmpty</Code><Message>This operation is not permitted on a non-empty directory.
RequestId:f3042353-701e-0069-7a68-dc7b41000000
Time:2023-09-01T00:08:11.0327332Z</Message></Error>

Versions:

fsspec==2023.6.0
adlfs==2023.8.0
Python==3.10

Installed from pip. Also it only happens sometimes, usually when I try to remove shortly (eg, < 5min) after creating the files

cjalmeida commented 1 year ago

For the record, can't reproduce the bug when on 2022.11.2

nosterlu commented 1 year ago

Still have the same issue also, been working around it to remove all files manually for the time being 😄

tspanos commented 1 year ago

I'm running into the same problem with version 2023.10.0 (adlfs/fsspec). Conda environment with Python 3.11.5 and azure-storage-blob version 12.18.3.

If I repeatedly run the command, it eventually works. It seems to delete some of the nested folders/files each time I run it, but errors before it completes all of them.

Tom-Newton commented 12 months ago

I think this problem might only effect storage accounts with hierarchical namespace enabled (ADLS gen2). I can reproduce it with basically any recursive delete on a hierarchical namespace account but not on a flat namespace account.

Is everyone else having issues, also using hierarchical namespace accounts?

This would also explain why it can be reproduced in any integration test because azurite does not support hierarchical namespace. The Azure error is also says "non-empty directory" but flat namespace accounts don't have directories so it would be strange to receive that error on a flat namespace account.

Tom-Newton commented 12 months ago

I'm pretty sure https://github.com/fsspec/adlfs/pull/383 is the cause. The commit before this it works correctly but with this commit it fails.

I have not properly understood what this PR does but from the description of this PR I think it makes sense why this is happening:

Group files by container_name and use asyncio.gather to remove the groups.

When using the azure blob client hierarchical namespace directories look a lot like blobs. If we asynchronously delete all the relevant blobs its highly likely that we attempt to delete the directory marker blob before we've finished deleting all of its contents.

nosterlu commented 12 months ago

I think this problem might only effect storage accounts with hierarchical namespace enabled (ADLS gen2). I can reproduce it with basically any recursive delete on a hierarchical namespace account but not on a flat namespace account.

Is everyone else having issues, also using hierarchical namespace accounts?

This would also explain why it can be reproduced in any integration test because azurite does not support hierarchical namespace. The Azure error is also says "non-empty directory" but flat namespace accounts don't have directories so it would be strange to receive that error on a flat namespace account.

Using hierarchical here!

Tom-Newton commented 12 months ago

I think it should be quite straightforward to fix. I'll give it a try

Tom-Newton commented 12 months ago

Probably not the neatest solution but I think it works. https://github.com/Tom-Newton/adlfs/pull/1 pip install https://github.com/Tom-Newton/adlfs/archive/aec77b00c1fa7fb5bfbbec88e1c9fac45f133e97.zip

I think ideally we would change the way it does listing so that files contains file details not just path strings. That would provide a better option for distinguishing files from directories.

nosterlu commented 11 months ago

Probably not the neatest solution but I think it works. Tom-Newton#1 pip install https://github.com/Tom-Newton/adlfs/archive/aec77b00c1fa7fb5bfbbec88e1c9fac45f133e97.zip

I think ideally we would change the way it does listing so that files contains file details not just path strings. That would provide a better option for distinguishing files from directories.

Nice! I played around a little also when trying to understand how your code worked!

This could maybe make it a little bit clearer?

async def _identify_directory_markers(self, files):
    """
    Identify the files and directory markers from the given list of files.
    A directory marker is identified if another file starts with the marker's name followed by '/'.
    """
    files = sorted(set(files))  # Remove duplicates and sort
    directory_markers = []
    blobs = []

    for i, file in enumerate(files):
        if i + 1 < len(files) and files[i + 1].startswith(file + "/"):
            # If the next file starts with the current file's name followed by '/',
            # consider it a directory marker.
            directory_markers.append(file)
        else:
            # Otherwise, it's a regular file/blob.
            blobs.append(file)

    return blobs, directory_markers

for example for testing

if __name__ == "__main__":

    def _identify_directory_markers_test(files):
        """
        Identify the files and directory markers from the given list of files.
        A directory marker is identified if another file starts with the marker's name
        followed by '/'.
        """
        files = sorted(set(files))  # Remove duplicates and sort
        directory_markers = []
        blobs = []

        for i, file in enumerate(files):
            if i + 1 < len(files) and files[i + 1].startswith(file + "/"):
                # If the next file starts with the current file's name followed by '/',
                # consider it a directory marker.
                directory_markers.append(file)
            else:
                # Otherwise, it's a regular file/blob.
                blobs.append(file)

        return blobs, directory_markers

    files = [
        "ptp_parquets/transports_unmapped.parquet",
        "ptp_parquets/transports.parquet",
        "ptp_parquets/road_transports.parquet",
        "ptp_parquets/ptp_parquets/transports_unmapped.parquet",
        "ptp_parquets/ptp_parquets/transports.parquet",
        "ptp_parquets/ptp_parquets/road_transports.parquet",
        "ptp_parquets/ptp_parquets/prod_prc.parquet",
        "ptp_parquets/ptp_parquets/packed_yesterday.parquet",
        "ptp_parquets/ptp_parquets/packed_transports.parquet",
        "ptp_parquets/ptp_parquets/orders.parquet",
        "ptp_parquets/ptp_parquets/lines.parquet",
        "ptp_parquets/ptp_parquets/flight_transports.parquet",
        "ptp_parquets/ptp_parquets/dist_freight.parquet",
        "ptp_parquets/ptp_parquets/container_transports.parquet",
        "ptp_parquets/ptp_parquets",  # folder
        "ptp_parquets/ptp_parquets",  # folder
        "ptp_parquets/prod_prc.parquet",
        "ptp_parquets/packed_yesterday.parquet",
        "ptp_parquets/packed_transports.parquet",
        "ptp_parquets/orders.parquet",
        "ptp_parquets/lines.parquet",
        "ptp_parquets/flight_transports.parquet",
        "ptp_parquets/dist_freight.parquet",
        "ptp_parquets/container_transports",  # file with no file ending!
        "ptp_parquets",   # folder
        "ptp_parquets",  # folder
    ]
    blobs, directory_markers = _identify_directory_markers_test(files)
    print("FILES")
    for blob in blobs:
        print(blob)
    print("\nDIRs")
    for d in directory_markers:
        print(d)

if all files are directories like this

files = [ "ptp_parquets", "orders", ]

they will all end up as blobs... but I guess it still will work, since there are no dangling files within them... but maybe there is a better way to identify files vs folders in an azure storage... 😅

Tom-Newton commented 11 months ago

@daavoo do you have any thoughts on this? https://github.com/fsspec/adlfs/pull/383 seems to remove the _isfile calls, but something along those lines is required to support hierarchical namespace storage accounts (ADLS gen2).

Tom-Newton commented 10 months ago

How would we feel about just reverting https://github.com/fsspec/adlfs/pull/383? I know it provides a big performance advantage for flat namespace accounts but it also breaks hierarchical namespace (ADLS gen2) accounts.

I think something like what @nosterlu and I described could work but there are probably edge cases to consider.