[Python] No limit pushdown for scanning Parquet on Azure

Describe the bug, including details regarding any error messages, version, and platform.

I am accessing a Parquet file on Azure data lake with the following code. (to make the example reproducible, the example has a file publicly accessible on Azure)

import pyarrow.dataset as ds
from adlfs import AzureBlobFileSystem

abfs_public = AzureBlobFileSystem(
    account_name="azureopendatastorage")

dataset_public = ds.dataset('az://nyctlc/yellow/puYear=2010/puMonth=1/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-18.c000.snappy.parquet', filesystem=abfs_public)

The processing time is the same for accessing the full file or only the first 5 rows:

dataset_public.head(5)
# 5min 11s

dataset_public.to_table()
# 5min 30s

dataset_public.scanner().head(5)
# 5min 43s

I would expect the time to be less for 5 rows. I am not sure about the difference between .scanner().head() and .head().

Regarding reducing the number of columns: reducing the number of columns retrieved speeds up the query, but the reduction seems small. For example, filtering to only 2 columns out of 21 reduces the query to 2min 7:

dataset_public.scanner(columns=['vendorID','passengerCount']).to_table()
# 2min 7s

I would have expected that collecting 10% of the columns ( 2 columns instead of 21 columns) to reduce the time by more than half. Unless there is an overhead to the query for collecting from Azure?

Sources for the code:

Apache Arrow website: https://arrow.apache.org/docs/python/parquet.html#reading-from-cloud-storage
ADLFS Github page: https://github.com/fsspec/adlfs

Thank you for the outstanding job on the Arrow library

Component(s)

Python

apache / arrow

[Python] No limit pushdown for scanning Parquet on Azure #34608

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)