catalyst-cooperative / pudl-catalog

An Intake catalog for distributing open energy system data liberated by Catalyst Cooperative.
https://catalyst.coop/pudl/
MIT License
9 stars 2 forks source link

Speed up querying of paritioned Parquet data on GCS #8

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

Using pd.read_parquet()

When using pd.read_parquet() reading data from a collection of remote parquet files using the gcs:// protocol takes twice as long as reading from a single parquet file, but no similar slowdown occurs locally:

# Select ~1% of the 800M rows in the dataset, from 6 of 1274 row groups: 
filters = [
    [('year', '=', 2019), ('state', '=', 'ID')],
    [('year', '=', 2019), ('state', '=', 'CO')],
    [('year', '=', 2019), ('state', '=', 'TX')],
    [('year', '=', 2020), ('state', '=', 'ID')],
    [('year', '=', 2020), ('state', '=', 'CO')],
    [('year', '=', 2020), ('state', '=', 'TX')]
]

single_file_local = pd.read_parquet("../data/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 2.58 s, sys: 778 ms, total: 3.35 s
# Wall time: 2.23 s

multi_file_local = pd.read_parquet("../data/hourly_emissions_epacems", filters=filters)
# CPU times: user 4.57 s, sys: 1.01 s, total: 5.58 s
# Wall time: 2.67 s

single_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 5.33 s, sys: 1.22 s, total: 6.56 s
# Wall time: 25 s

multi_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems", filters=filters)
# CPU times: user 16.2 s, sys: 2.61 s, total: 18.8 s
# Wall time: 51.7 s

Using intake_parquet

Even ignoring the close to 12 minutes of apparent network transfer time, the same query only took 25 seconds with pd.read_parquet() and here it took 3 minutes. Really need to be able to toggle caching on and off before I can experiment here.


# Not sure giving it empty storage options had the effect of disabling caching.
# It seems to have re-downloaded the whole dataset and put it... where?
single_file_intake = pudl_cat.hourly_emissions_epacems(
    storage_options={}, filters=filters
).to_dask().compute()
# CPU times: user 2min 17s, sys: 44.2 s, total: 3min 1s
# Wall time: 14min 49s
zaneselvans commented 2 years ago

Consolidating this issue with #4 since I'm pretty sure they're both fixable with external metadata.