apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[Python] No limit pushdown for scanning Parquet on Azure #34608

Open lucazanna opened 1 year ago

lucazanna commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

I am accessing a Parquet file on Azure data lake with the following code. (to make the example reproducible, the example has a file publicly accessible on Azure)

import pyarrow.dataset as ds
from adlfs import AzureBlobFileSystem

abfs_public = AzureBlobFileSystem(
    account_name="azureopendatastorage")

dataset_public = ds.dataset('az://nyctlc/yellow/puYear=2010/puMonth=1/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-18.c000.snappy.parquet', filesystem=abfs_public)

The processing time is the same for accessing the full file or only the first 5 rows:

dataset_public.head(5)
# 5min 11s

dataset_public.to_table()
# 5min 30s

dataset_public.scanner().head(5)
# 5min 43s

I would expect the time to be less for 5 rows. I am not sure about the difference between .scanner().head() and .head().

Regarding reducing the number of columns: reducing the number of columns retrieved speeds up the query, but the reduction seems small. For example, filtering to only 2 columns out of 21 reduces the query to 2min 7:

dataset_public.scanner(columns=['vendorID','passengerCount']).to_table()
# 2min 7s

I would have expected that collecting 10% of the columns ( 2 columns instead of 21 columns) to reduce the time by more than half. Unless there is an overhead to the query for collecting from Azure?

Sources for the code:

Thank you for the outstanding job on the Arrow library

Component(s)

Python

Tom-Newton commented 9 months ago

I can't remember especially clearly but I think I have had some success using limit on Azure before. One thing to bear in mind is that I think arrow will only read entire row groups. So depending on the row group layout of your parquet file that might explain things.

Additionally I have found that adlfs does some kind of prefetching thing which personally I've found to be counter-productive. When I tested this I was using a native arrow filesystem implementation, which is very close to being available officially in pyarrow https://github.com/apache/arrow/issues/39317. The native filesystem should also be generally faster and more reliable.