fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

Fix performance issues with MemoryFileSystem.rm #1725

Closed maresb closed 1 month ago

maresb commented 1 month ago

Closes #1724.

The idea is that the nominal case is deleting an existing file, and verifying that a path corresponds to a file is an extremely fast key-existence check on the underlying store (dict). Thus by checking first if it's a file we escape the expensive exists check, which must scan all keys to check if the path is a prefix directory for some file.

Now the time to write a single file is roughly constant (instead of linear in the number of files), while the time to remove all files scales linearly with the number of files (instead of quadratically).

Before:

image

After:

image

Before:

image

$$10^{1.945\, \log_{10}(N) - 6.5} = N^{1.945} / 3\times 10^6.$$

After:

image

$$10^{0.94\, \log_{10}(N) - 5} = N^{0.945} / \times 10^5.$$

martindurant commented 1 month ago

A consequence of MemoryFS almost exclusively being used in testing for at most 100s of files.