Using `pd.read_parquet()`

When using pd.read_parquet() reading data from a collection of remote parquet files using the gcs:// protocol takes twice as long as reading from a single parquet file, but no similar slowdown occurs locally:

# Select ~1% of the 800M rows in the dataset, from 6 of 1274 row groups: 
filters = [
    [('year', '=', 2019), ('state', '=', 'ID')],
    [('year', '=', 2019), ('state', '=', 'CO')],
    [('year', '=', 2019), ('state', '=', 'TX')],
    [('year', '=', 2020), ('state', '=', 'ID')],
    [('year', '=', 2020), ('state', '=', 'CO')],
    [('year', '=', 2020), ('state', '=', 'TX')]
]

single_file_local = pd.read_parquet("../data/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 2.58 s, sys: 778 ms, total: 3.35 s
# Wall time: 2.23 s

multi_file_local = pd.read_parquet("../data/hourly_emissions_epacems", filters=filters)
# CPU times: user 4.57 s, sys: 1.01 s, total: 5.58 s
# Wall time: 2.67 s

single_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 5.33 s, sys: 1.22 s, total: 6.56 s
# Wall time: 25 s

multi_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems", filters=filters)
# CPU times: user 16.2 s, sys: 2.61 s, total: 18.8 s
# Wall time: 51.7 s

Is it not able to use the pushdown filtering remotely to only scan the files / block groups that have the requested data?
Looking at the reports from %%time the user time does double locally for the partitioned data, but the elapsed time doesn't. Is it working with multiple threads locally, but only a single thread remotely?

Using `intake_parquet`

Even ignoring the close to 12 minutes of apparent network transfer time, the same query only took 25 seconds with pd.read_parquet() and here it took 3 minutes. Really need to be able to toggle caching on and off before I can experiment here.


# Not sure giving it empty storage options had the effect of disabling caching.
# It seems to have re-downloaded the whole dataset and put it... where?
single_file_intake = pudl_cat.hourly_emissions_epacems(
    storage_options={}, filters=filters
).to_dask().compute()
# CPU times: user 2min 17s, sys: 44.2 s, total: 3min 1s
# Wall time: 14min 49s

catalyst-cooperative / pudl-catalog

Speed up querying of paritioned Parquet data on GCS #8

Using `pd.read_parquet()`

Using `intake_parquet`

catalyst-cooperative / pudl-catalog

Speed up querying of paritioned Parquet data on GCS #8

Using pd.read_parquet()

Using intake_parquet

Using `pd.read_parquet()`

Using `intake_parquet`