fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
334 stars 143 forks source link

`fs.isdir` latency 200x slower beginning with version 2023.09.01 #611

Closed chotzen closed 6 months ago

chotzen commented 6 months ago

Hello,

I'm running the following code to measure the latency of isdir:

fs = GCSFileSystem(
    token="google_default", projectstring="..."
)

datadir = "<path to GCS directory with thousands of files>"
filenames = ["gs://" + x["name"] for x in fs.listdir(datadir)][:500]
times = []
for f in tqdm(filenames):
    begin = time.time()

    fs.isdir(f)

    end = time.time()
    times.append(end - begin)

print("Average time: ", sum(times) / len(times))

In version 2023.09.01, the average time per fs.isdir() call is 0.05 seconds. In version 2023.09.00, the average time is 0.0001 seconds. This causes a significant slowdown (from 2 seconds to several minutes) when multiplied by the thousands of files in our GCS directory.

Thank you for your help, Devin

martindurant commented 6 months ago

I can confirm that the directory cache is not working correctly - will look into it tomorrow.