fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

Removing large zarr stores recursively blows up memory #690

Open jbusecke opened 3 years ago

jbusecke commented 3 years ago

I am trying to remove a rather large zarr store from the pangeo scratch bucket:

I am doing this on the pangeo google deployment (production)

import fsspec
import os

fs = fsspec.filesystem('gcs')
print(fs.ls('pangeo-scratch/jbusecke/detrending_test/'))

for path in fs.ls('pangeo-scratch/jbusecke/detrending_test/'):
    fs.rm(path, recursive=True)
    print(f'Finished with {path}')

This will take some time, over which memory usage on the notebook process increases steadily and eventually the kernel crashes.

Is this a misuse of fsspec (I am very new to all of this).

Or is this a bug that might have a workaround? Any help to debug this would be greatly appreciated.

martindurant commented 3 years ago

This seems like a very reasonable thing to do. Please post your versions. Do the files in the paths that execute before crash successfully disappear?

jbusecke commented 3 years ago

Hi @martindurant, sorry for the radio silence. I was on vacation and stuff got a bit hectic before leaving.

So this occurs with version '2021.04.0' of fsspec, what other package versions would be helpful for you to know?

Here is the output of xr.show_versions() just in case:

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.109+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.17.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.2
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.7.1
cftime: 1.4.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.2
cfgrib: 0.9.9.0
iris: None
bottleneck: 1.3.2
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: 3.4.1
cartopy: 0.19.0
seaborn: None
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 20.3.4
conda: None
pytest: None
IPython: 7.22.0
sphinx: None

Do the files in the paths that execute before crash successfully disappear?

I think some of the chunks were successfully deleted if I remember correctly. I just tried it again with a larger RAM server and I was able to delete the smaller of two zarr stores similiarly large stores successfully, but the memory usage was quite high still (went up about 2+GB, and for the larger one it blew the memory out again). Is the memory footpring supposed to scale with the size of the deleted folder?

Do you have any suggestions for further debugging? Many thanks.

martindurant commented 3 years ago

@isidentical , do you think it might help to replace the explicit asyncio.gather in GCSFileSystem._rm with the throttled version?

@jbusecke , it is hard to tell where the memory runs out - it would be useful to know whether it is during the listing of the files or when issuing the delete calls. If some files were indeed deleted, this suggests the latter, in which case my suggestion in the line above might help. Also, you might try increasing the batchsize argument to rm, which makes for fewer, larger calls (and I think it's the number of calls which will pose the problem).

Another improvement we could make, if to add an on_error argument, as is done for some other batch methods, so that we can avoid return_exceptions=True in gather, which might present a memory saving for failed calls.

isidentical commented 3 years ago

@isidentical , do you think it might help to replace the explicit asyncio.gather in GCSFileSystem._rm with the throttled version?

If the memory consuption is scaled with the number of calls to _rm_files, then yes it might work.

martindurant commented 3 years ago

@jbusecke , the throttled gather is fsspec.asyn._run_coros_in_chunks, where you would specify a chunksize. Would you like to try to implement this, see if it helps?

@isidentical , it occurs to me that it might be nice if that function accepted a generator for coros, not just a list, so that they don't all need to be instantiated ahead of time.