Open lucazanna opened 1 year ago
I can't remember especially clearly but I think I have had some success using limit on Azure before. One thing to bear in mind is that I think arrow will only read entire row groups. So depending on the row group layout of your parquet file that might explain things.
Additionally I have found that adlfs
does some kind of prefetching thing which personally I've found to be counter-productive. When I tested this I was using a native arrow filesystem implementation, which is very close to being available officially in pyarrow
https://github.com/apache/arrow/issues/39317. The native filesystem should also be generally faster and more reliable.
Describe the bug, including details regarding any error messages, version, and platform.
I am accessing a Parquet file on Azure data lake with the following code. (to make the example reproducible, the example has a file publicly accessible on Azure)
The processing time is the same for accessing the full file or only the first 5 rows:
I would expect the time to be less for 5 rows. I am not sure about the difference between .scanner().head() and .head().
Regarding reducing the number of columns: reducing the number of columns retrieved speeds up the query, but the reduction seems small. For example, filtering to only 2 columns out of 21 reduces the query to 2min 7:
I would have expected that collecting 10% of the columns ( 2 columns instead of 21 columns) to reduce the time by more than half. Unless there is an overhead to the query for collecting from Azure?
Sources for the code:
Thank you for the outstanding job on the Arrow library
Component(s)
Python