Hi, like the title says, the glob function (https://github.com/fsspec/adlfs/blob/main/adlfs/spec.py#L576) is slow and inefficient in specific use-cases (especially when using azure data lake gen 2's hierarchical namespace) because from the first * it stops matching and just requests everything after it.
In our use case, we've got data structured with a few columns which are often used to filter, then followed by year/month/day(/hour).
So if we have files like these for about 3 years, with the filter columns having a cardinality of 5 values (as an example), with 1 file per hour:
some/common/prefix/filtercol1=somevalue1/filtercol2=somevalue2/year=2023/month=01/day=12/hour=01/0.parquet
and then want to query all files from a specific day and filtercol2 value, but we don't care about the filtercol1 value like so:
some/common/prefix/filtercol1=*/filtercol2=somevalue2/year=2023/month=01/day=12/hour=*/*.parquet
We would like to have to check 5 1 1 1 24 = 120 files
But instead, the implementation of the glob function checks 5 5 3 365 24 = 657000 files (as far as we understand)
We've got a working example that does nothing of the fancy stuff of the current _glob function, but saves us about 50 euros per day (difference seen in azure cost management) just by not listing all the files all the time: https://github.com/mh-data-science/adlfs/blob/master/adlfs/spec.py#L689, not to mention the difference in speed.
Known issue: we don't support * this way (i.e. is only valid between two / characters, we're traversing the actual folders), and probably not a lot of the kwargs stuff or other fancy things either.
So questions:
Are we correct in our understanding of the current _glob function?
If yes, what do we need to do to get this conceptual improvement into the main branch?
Feedback on our glob code and how to add ** support always welcome, especially if a pull request is expected
I'm pretty sure there has been a big regression in the performance of glob at some point. For one of our usecases:
adlfs==2022.11.0: 2 seconds
adlfs==2024.1.0: After about 3 hours it fails and never completes.
Hi, like the title says, the glob function (https://github.com/fsspec/adlfs/blob/main/adlfs/spec.py#L576) is slow and inefficient in specific use-cases (especially when using azure data lake gen 2's hierarchical namespace) because from the first * it stops matching and just requests everything after it.
In our use case, we've got data structured with a few columns which are often used to filter, then followed by year/month/day(/hour).
So if we have files like these for about 3 years, with the filter columns having a cardinality of 5 values (as an example), with 1 file per hour:
some/common/prefix/filtercol1=somevalue1/filtercol2=somevalue2/year=2023/month=01/day=12/hour=01/0.parquet
and then want to query all files from a specific day and filtercol2 value, but we don't care about the filtercol1 value like so:some/common/prefix/filtercol1=*/filtercol2=somevalue2/year=2023/month=01/day=12/hour=*/*.parquet
We would like to have to check 5 1 1 1 24 = 120 filesBut instead, the implementation of the glob function checks 5 5 3 365 24 = 657000 files (as far as we understand)
We've got a working example that does nothing of the fancy stuff of the current _glob function, but saves us about 50 euros per day (difference seen in azure cost management) just by not listing all the files all the time: https://github.com/mh-data-science/adlfs/blob/master/adlfs/spec.py#L689, not to mention the difference in speed.
Known issue: we don't support * this way (i.e. is only valid between two / characters, we're traversing the actual folders), and probably not a lot of the kwargs stuff or other fancy things either.
So questions: