Closed maresb closed 1 month ago
By the way, is there some simple and safe way to nuke the contents of a
MemoryFileSystem
?
Answering my own question: https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/conftest.py#L13-L27
My use case:
I am writing a Zarr to a
MemoryFileSystem
, and that Zarr contains about 100,000 files. Subsequently, I overwrite the original Zarr with a much smaller Zarr, and this takes about half an hour. I was really perplexed that such a seemingly small operation would require so much time.Profiling the problem:
By profiling, I tracked it down to the
rm(path, recursive=True)
operation triggered byxr.Dataset.to_zarr(mode='w')
.We iterate over all the files to be deleted,
https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/implementations/memory.py#L257-L267
but the problem is that
info
and henceisfile
andexists
all have runtime $O(N)$ in the number of files $N$:https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/spec.py#L707-L712
https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/implementations/memory.py#L149-L169
The problematic line is 153, where in order to rule out that the path is a directory, we iterate over each file to check whether the path a parent directory of any file.
Quantifying the effect
Let's collect some timing data.
Deleting a single file
Log-log-plot so that a straight line corresponds to a power law:
Doing a linear regression (dropping 1 and 10 files), the log-log slope is 0.973 (very close to linear), and the log10 intercept is -6.26, so time in seconds is about
$$10^{0.973\, \log_{10}(N) - 6.26} = N^{0.973} / 1.8 \times 10^6.$$
Deleting all files
Here we observe a painful 2 minutes to delete 25,600 files.
Regression gives the number of seconds to be about
$$10^{1.945\, \log_{10}(N) - 6.5} = N^{1.945} / 3\times 10^6.$$
Thoughts regarding a solution
~Unfortunately I don't see any obvious simple solution that doesn't increase code complexity. After briefly tinkering with the test suite, it seems like there are some delicate edge cases for which some obvious simplifying tweaks fail.~
EDIT: Found one in #1725!
By the way, is there some simple and safe way to nuke the contents of a
MemoryFileSystem
?