Closed stuartlynn closed 2 months ago
Some benchmarks
Query plan
SELECT [col("B17021_E006"), col("GEO_ID")] FROM
Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
Time (mean ± σ): 3.164 s ± 0.284 s [User: 0.407 s, System: 0.159 s]
Range (min … max): 2.684 s … 3.447 s 10 runs
Query plan
FILTER col("GEO_ID").is_in([Series[geo_ids]]) FROM
SELECT [col("B17021_E006"), col("GEO_ID")] FROM
Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
Time (mean ± σ): 7.296 s ± 0.312 s [User: 4.364 s, System: 0.182 s]
Range (min … max): 6.866 s … 8.064 s 10 runs
This is a bit weird and I am wondering if the issue is the large header for this file (which has about 7000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.
Broadly: I'm wondering about use cases. Is there a situation where we want to get the same metric for different geometries (e.g. maybe different countries)? In that case would it be fair to say that it is the user's responsibility to call get_metrics()
multiple times for each geometry and concatenate the tables themselves?
This PR adds a function that takes a list of MetricRequests and fetches the data from cloud storage over http range requests in an efficient manner.
This generates the following results
There is also a way to filter by GEOIDs as we do so
which gives the result
TODO
The looks like the geo filtering version of the code is slower than the non geo filtering version. This is a bit counter intuitive so I want to properly benchmark it to see if that's true and try and figure out why.Opened ticket to follow up on this #17