stuartlynn commented 2 months ago

This PR adds a function that takes a list of MetricRequests and fetches the data from cloud storage over http range requests in an efficient manner.

let metrics  = [
    MetricRequest{
        file:"https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet".into(), 
        column:"B17021_E006".into() 
   }];
   let df = get_metrics(&metrics, None);

This generates the following results

test metadata::tests::test_loading_metadata_from_url ... ok
Ok(
    shape: (74_001, 2)
    ┌─────────────┬──────────────────────┐
    │ B17021_E006 ┆ GEO_ID               │
    │ ---         ┆ ---                  │
    │ i64         ┆ str                  │
    ╞═════════════╪══════════════════════╡
    │ 0           ┆ 1400000US01001020100 │
    │ 0           ┆ 1400000US01001020200 │
    │ 11          ┆ 1400000US01001020300 │
    │ 10          ┆ 1400000US01001020400 │
    │ 0           ┆ 1400000US01001020500 │
    │ …           ┆ …                    │
    │ 12          ┆ 1400000US72153750501 │
    │ 0           ┆ 1400000US72153750502 │
    │ 19          ┆ 1400000US72153750503 │
    │ 0           ┆ 1400000US72153750601 │
    │ 0           ┆ 1400000US72153750602 │
    └─────────────┴──────────────────────┘,
)

There is also a way to filter by GEOIDs as we do so

        let metrics  = [
            MetricRequest{
                file:"https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet".into(), 
                column:"B17021_E006".into() 
            }];
        let df = get_metrics(
            &metrics,
            Some(&["1400000US01001020100".into(), "1400000US01001020300".into()]),
        );

which gives the result

Ok(
    shape: (2, 2)
    ┌─────────────┬──────────────────────┐
    │ B17021_E006 ┆ GEO_ID               │
    │ ---         ┆ ---                  │
    │ i64         ┆ str                  │
    ╞═════════════╪══════════════════════╡
    │ 0           ┆ 1400000US01001020100 │
    │ 11          ┆ 1400000US01001020300 │
    └─────────────┴──────────────────────┘,
)

TODO

[x] There is one place where we use an unwrap in the code. This could potentially lead to a panic but I am like 99% sure that panic is impossible from the previous code. It's probably worth trying to remove the unwrap and handle the result explicitly even if it is impossible.
[ ] The looks like the geo filtering version of the code is slower than the non geo filtering version. This is a bit counter intuitive so I want to properly benchmark it to see if that's true and try and figure out why. Opened ticket to follow up on this #17
[x] Some code ergonomics could probably be improved here. Lots of .into's for Strings which I can probably make go away

stuartlynn commented 2 months ago

Some benchmarks

Without geo filtering

Query plan

 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS

Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      3.164 s ±  0.284 s    [User: 0.407 s, System: 0.159 s]
  Range (min … max):    2.684 s …  3.447 s    10 runs

With geo filtering

Query plan

FILTER col("GEO_ID").is_in([Series[geo_ids]]) FROM
 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS

Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      7.296 s ±  0.312 s    [User: 4.364 s, System: 0.182 s]
  Range (min … max):    6.866 s …  8.064 s    10 runs

This is a bit weird and I am wondering if the issue is the large header for this file (which has about 7000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.

yongrenjie commented 2 months ago

Broadly: I'm wondering about use cases. Is there a situation where we want to get the same metric for different geometries (e.g. maybe different countries)? In that case would it be fair to say that it is the user's responsibility to call get_metrics() multiple times for each geometry and concatenate the tables themselves?

Urban-Analytics-Technology-Platform / popgetter-cli

Functions for reading metrics remotely from Azure #16

TODO

Without geo filtering

With geo filtering