fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

MemoryFileSystem time complexity of rm renders it unusable with large number of files #1724

Closed maresb closed 1 month ago

maresb commented 1 month ago

My use case:

I am writing a Zarr to a MemoryFileSystem, and that Zarr contains about 100,000 files. Subsequently, I overwrite the original Zarr with a much smaller Zarr, and this takes about half an hour. I was really perplexed that such a seemingly small operation would require so much time.

Profiling the problem:

By profiling, I tracked it down to the rm(path, recursive=True) operation triggered by xr.Dataset.to_zarr(mode='w').

We iterate over all the files to be deleted,

https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/implementations/memory.py#L257-L267

but the problem is that info and hence isfile and exists all have runtime $O(N)$ in the number of files $N$:

https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/spec.py#L707-L712

https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/implementations/memory.py#L149-L169

The problematic line is 153, where in order to rule out that the path is a directory, we iterate over each file to check whether the path a parent directory of any file.

Quantifying the effect

Let's collect some timing data.

Deleting a single file

from time import time

from fsspec.implementations.memory import MemoryFileSystem

fs = MemoryFileSystem("memory://store")

delete_all_files_time: dict[int, float] = {}
for num_files in [10 ** i for i in range(7)]:
    print(f"num_files: {num_files}")
    for i in range(num_files):
        fs.touch(f"file{i}")
    start = time()
    fs.rm("file0")
    end = time()
    delete_all_files_time[num_files] = end - start
    print(f"delete_single_file_time: {delete_all_files_time[num_files]}")
num_files: 1
delete_single_file_time: 3.361701965332031e-05
num_files: 10
delete_single_file_time: 2.47955322265625e-05
num_files: 100
delete_single_file_time: 5.435943603515625e-05
num_files: 1000
delete_single_file_time: 0.00043773651123046875
num_files: 10000
delete_single_file_time: 0.003747701644897461
num_files: 100000
delete_single_file_time: 0.03829622268676758
num_files: 1000000
delete_single_file_time: 0.42734217643737793

Log-log-plot so that a straight line corresponds to a power law:

image

Doing a linear regression (dropping 1 and 10 files), the log-log slope is 0.973 (very close to linear), and the log10 intercept is -6.26, so time in seconds is about

$$10^{0.973\, \log_{10}(N) - 6.26} = N^{0.973} / 1.8 \times 10^6.$$

Deleting all files

from time import time

from fsspec.implementations.memory import MemoryFileSystem

fs = MemoryFileSystem("memory://store")

delete_all_files_time: dict[int, float] = {}
for num_files in [100 * 2 ** i for i in range(9)]:
    print(f"num_files: {num_files}")
    for i in range(num_files):
        fs.touch(f"file{i}")
    start = time()
    fs.rm("/", recursive=True)
    end = time()
    delete_all_files_time[num_files] = end - start
    print(f"delete_all_files_time: {delete_all_files_time[num_files]}")
num_files: 100
delete_all_files_time: 0.0032181739807128906
num_files: 200
delete_all_files_time: 0.010508298873901367
num_files: 400
delete_all_files_time: 0.034881591796875
num_files: 800
delete_all_files_time: 0.14718341827392578
num_files: 1600
delete_all_files_time: 0.4914524555206299
num_files: 3200
delete_all_files_time: 1.9803588390350342
num_files: 6400
delete_all_files_time: 7.475835800170898
num_files: 12800
delete_all_files_time: 29.763574361801147
num_files: 25600
delete_all_files_time: 119.04087710380554

Here we observe a painful 2 minutes to delete 25,600 files.

Regression gives the number of seconds to be about

$$10^{1.945\, \log_{10}(N) - 6.5} = N^{1.945} / 3\times 10^6.$$

image

Thoughts regarding a solution

~Unfortunately I don't see any obvious simple solution that doesn't increase code complexity. After briefly tinkering with the test suite, it seems like there are some delicate edge cases for which some obvious simplifying tweaks fail.~

EDIT: Found one in #1725!

By the way, is there some simple and safe way to nuke the contents of a MemoryFileSystem?

maresb commented 1 month ago

By the way, is there some simple and safe way to nuke the contents of a MemoryFileSystem?

Answering my own question: https://github.com/fsspec/filesystem_spec/blob/176efbe02179f30b5862bf7444a383b8e62f87df/fsspec/conftest.py#L13-L27