Urban-Analytics-Technology-Platform / popgetter-cli

A rust library and CLI for accessing popgetter data
0 stars 0 forks source link

Functions for reading metrics remotely from Azure #16

Closed stuartlynn closed 2 months ago

stuartlynn commented 2 months ago

This PR adds a function that takes a list of MetricRequests and fetches the data from cloud storage over http range requests in an efficient manner.

let metrics  = [
    MetricRequest{
        file:"https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet".into(), 
        column:"B17021_E006".into() 
   }];
   let df = get_metrics(&metrics, None);

This generates the following results

test metadata::tests::test_loading_metadata_from_url ... ok
Ok(
    shape: (74_001, 2)
    ┌─────────────┬──────────────────────┐
    │ B17021_E006 ┆ GEO_ID               │
    │ ---         ┆ ---                  │
    │ i64         ┆ str                  │
    ╞═════════════╪══════════════════════╡
    │ 0           ┆ 1400000US01001020100 │
    │ 0           ┆ 1400000US01001020200 │
    │ 11          ┆ 1400000US01001020300 │
    │ 10          ┆ 1400000US01001020400 │
    │ 0           ┆ 1400000US01001020500 │
    │ …           ┆ …                    │
    │ 12          ┆ 1400000US72153750501 │
    │ 0           ┆ 1400000US72153750502 │
    │ 19          ┆ 1400000US72153750503 │
    │ 0           ┆ 1400000US72153750601 │
    │ 0           ┆ 1400000US72153750602 │
    └─────────────┴──────────────────────┘,
)

There is also a way to filter by GEOIDs as we do so

        let metrics  = [
            MetricRequest{
                file:"https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet".into(), 
                column:"B17021_E006".into() 
            }];
        let df = get_metrics(
            &metrics,
            Some(&["1400000US01001020100".into(), "1400000US01001020300".into()]),
        );

which gives the result

Ok(
    shape: (2, 2)
    ┌─────────────┬──────────────────────┐
    │ B17021_E006 ┆ GEO_ID               │
    │ ---         ┆ ---                  │
    │ i64         ┆ str                  │
    ╞═════════════╪══════════════════════╡
    │ 0           ┆ 1400000US01001020100 │
    │ 11          ┆ 1400000US01001020300 │
    └─────────────┴──────────────────────┘,
)

TODO

stuartlynn commented 2 months ago

Some benchmarks

Without geo filtering

Query plan

 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      3.164 s ±  0.284 s    [User: 0.407 s, System: 0.159 s]
  Range (min … max):    2.684 s …  3.447 s    10 runs

With geo filtering

Query plan

FILTER col("GEO_ID").is_in([Series[geo_ids]]) FROM
 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      7.296 s ±  0.312 s    [User: 4.364 s, System: 0.182 s]
  Range (min … max):    6.866 s …  8.064 s    10 runs

This is a bit weird and I am wondering if the issue is the large header for this file (which has about 7000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.

yongrenjie commented 2 months ago

Broadly: I'm wondering about use cases. Is there a situation where we want to get the same metric for different geometries (e.g. maybe different countries)? In that case would it be fair to say that it is the user's responsibility to call get_metrics() multiple times for each geometry and concatenate the tables themselves?