fsspec / adlfs

fsspec-compatible Azure Datake and Azure Blob Storage access
BSD 3-Clause "New" or "Revised" License
175 stars 104 forks source link

Glob function is slow & inefficient #388

Open JoranDox opened 1 year ago

JoranDox commented 1 year ago

Hi, like the title says, the glob function (https://github.com/fsspec/adlfs/blob/main/adlfs/spec.py#L576) is slow and inefficient in specific use-cases (especially when using azure data lake gen 2's hierarchical namespace) because from the first * it stops matching and just requests everything after it.

In our use case, we've got data structured with a few columns which are often used to filter, then followed by year/month/day(/hour).

So if we have files like these for about 3 years, with the filter columns having a cardinality of 5 values (as an example), with 1 file per hour: some/common/prefix/filtercol1=somevalue1/filtercol2=somevalue2/year=2023/month=01/day=12/hour=01/0.parquet and then want to query all files from a specific day and filtercol2 value, but we don't care about the filtercol1 value like so: some/common/prefix/filtercol1=*/filtercol2=somevalue2/year=2023/month=01/day=12/hour=*/*.parquet We would like to have to check 5 1 1 1 24 = 120 files

But instead, the implementation of the glob function checks 5 5 3 365 24 = 657000 files (as far as we understand)

We've got a working example that does nothing of the fancy stuff of the current _glob function, but saves us about 50 euros per day (difference seen in azure cost management) just by not listing all the files all the time: https://github.com/mh-data-science/adlfs/blob/master/adlfs/spec.py#L689, not to mention the difference in speed.

Known issue: we don't support * this way (i.e. is only valid between two / characters, we're traversing the actual folders), and probably not a lot of the kwargs stuff or other fancy things either.

So questions:

Tom-Newton commented 8 months ago

I'm pretty sure there has been a big regression in the performance of glob at some point. For one of our usecases: adlfs==2022.11.0: 2 seconds adlfs==2024.1.0: After about 3 hours it fails and never completes.