bwlewis / duckdb_and_r

My thoughts and examples on DuckDB and R
13 stars 1 forks source link

Example of this point. #3

Open iangow opened 8 months ago

iangow commented 8 months ago

https://github.com/bwlewis/duckdb_and_r/blob/d4de43f682719fae54274887d7d01993f53c3e4c/tpch/tpch.rmd#L111-L112

I think this is very much true. I have an example here where, even in expert hands the SQL is "inaccurate", but the dplyr-generated query is not.

(I apologise for polluting your repository with issues. Read these a comments on your blog posts.)

bwlewis commented 8 months ago

I enjoyed your notes on the Tanuimura book! (which I do not know) I found all the code (SQL and R) to be a bit over my head. I think that the question itself is very badly worded, which makes answering it consistently difficult :(

Here are some notes starting from scratch as best I could using the question as you quoted it and base R. Of course I get yet another different answer!

# Notes on https://github.com/iangow/notes/blob/main/ctes.pdf
#
# Here is the question as quoted. (I don't have access to the original version.)
#
# "What share of [legislators] start as representatives and go on to become
# senators?  (Some senators later become representatives, but that is much less
# common.) Since relatively few make this transition, we’ll cohort legislators
# by the century in which they first became a representative"
#
# The question is awkwardly worded, even aside from the bizarre use of "cohort"
# as a verb.  The main question I had after reading this is share of what? That
# is, what does the word "legislators" mean? Does it mean share of US
# representatives? Of all congressmen (representatives and senators)?
#
# Since "legislator" cohorts are to be defined by the first year in which they
# become a *representative* (that is, the word "legislators" is implicitly
# identified as "representative" in that sentence above), then it seems to me
# that this questions is really asking:
#
# Group representatives by the century of the first year they became a
# representative.  For each group, compute the percentage of representatives
# that went on to become senators.
#
# That may not be what the author had in mind, but it's the clearest reading
# I can make out of the word salad. The following R code does that.

url = paste0("https://raw.githubusercontent.com/cathytanimura/", "sql_book/master/Chapter%204%3A%20Cohorts/")
terms = read.csv(paste0(url, "legislators_terms.csv"))[, c("id_bioguide", "term_start", "term_type")]
terms[["term_start"]] = as.Date(terms[["term_start"]])

# order by id and start date
terms = terms[order(terms[["id_bioguide"]], terms[["term_start"]]),]

# add a variable that indicates if the legislator was ever a senator
terms[["ever_a_senator"]] = ave(terms[["term_type"]], terms[["id_bioguide"]], FUN=function(x) any(x == "sen")) == TRUE

# find first term per id
n1 = c(1, diff(as.integer(factor(terms[["id_bioguide"]])))) != 0

# first term type and start, ever in senate by id:
first_terms = terms[n1, c("id_bioguide", "term_type", "term_start", "ever_a_senator")]

# Based on above discussion, only interested in legislators starting in the US House:
first_terms = first_terms[first_terms[["term_type"]] == "rep",]

# Add century...
first_terms[["first_term_century"]] = 1 + floor(as.numeric(format(first_terms[["term_start"]], "%Y"))/100)

# All that's left is a trivial aggregation:
ans = aggregate(list("Ratio_of_reps_that_went_on_to_senate" = first_terms[["ever_a_senator"]]),
                by = list("Fist_term_century" = first_terms[["first_term_century"]]),
                FUN=function(x) sum(x) / length(x))

print(ans)
    Fist_term_century Ratio_of_reps_that_went_on_to_senate
1                18                           0.19191919
2                19                           0.05675487
3                20                           0.05678516
4                21                           0.03660322
iangow commented 8 months ago

From a quick look at your code, I think one difference would be the date filter (likely imposed because there's not enough data on recent congresspeople).

The Tanimura book is a nice one for example queries (though nothing really "big" for benchmarking).

I've been dabbling a bit with Ibis recently. It offers a number of backends, including polars (see here).

bwlewis commented 8 months ago

Yeah, I saw those complex date windows in the queries -- but not in the quoted question! That was confusing.

I don't know if that question is representative, but its poor wording does not bode well for the book for me anyway.

On 1/9/24, Ian Gow @.***> wrote:

From a quick look at your code, I think one difference would be the date filter (likely imposed because there's not enough data on recent congresspeople).

The Tanimura book is a nice one for example queries (though nothing really "big" for benchmarking).

I've been dabbling a bit with Ibis recently. It offers a number of backends, including polars (see here).

-- Reply to this email directly or view it on GitHub: https://github.com/bwlewis/duckdb_and_r/issues/3#issuecomment-1883972250 You are receiving this because you commented.

Message ID: @.***>

iangow commented 8 months ago

It's a complex topic (see here), though I too noted that the writing could use some work. (I first read the book through an institutional subscription to O'Reilly.)

bwlewis commented 8 months ago

Here is a another base-R quick take on:

Group representatives by the century of the first year they became a representative. For each group, compute the percentage of representatives that went on to become senators within 5 , 10, and 15 years, or ever.

The results are yet different again, although consistent with the previous forever version:

url = paste0("https://raw.githubusercontent.com/cathytanimura/", "sql_book/master/Chapter%204%3A%20Cohorts/")
terms = read.csv(paste0(url, "legislators_terms.csv"))[, c("id_bioguide", "term_start", "term_type")]
terms[["term_start"]] = as.Date(terms[["term_start"]])
terms[["id_bioguide"]] = factor(terms[["id_bioguide"]])

# order by id and start date
terms = terms[order(terms[["id_bioguide"]], terms[["term_start"]]),]

# find first term per id
n1 = c(1, diff(as.integer(terms[["id_bioguide"]]))) != 0

# *minimum* days-to-senate per person (a bit janky, even for me)
d = ave(as.integer(factor(terms$term_type)), terms$id_bioguide, FUN=function(x) cumsum(c(0,diff(x))!=0))
d[n1] = 1
terms[["dts"]] = ave(
  (as.numeric(terms[["term_start"]]) - as.numeric(as.Date("1789-1-1"))) * d,
  terms[["id_bioguide"]],
  FUN =function(x) {
    delta = (x-x[1])
    delta[delta <= 0 ] = Inf      # this is pretty icky, surely a better way
    m = min(delta, na.rm = TRUE)
    m[is.infinite(m)] = 0
    m
  }
)

# first term type and start, ever in senate, dts by id:
first_terms = terms[n1, c("id_bioguide", "term_type", "term_start", "dts")]

# Based on above discussion, only interested in legislators starting in the US House:
first_terms = first_terms[first_terms[["term_type"]] == "rep",]

# Add century...
first_terms[["first_term_century"]] = 1 + floor(as.numeric(format(first_terms[["term_start"]], "%Y"))/100)

# All that's left is a trivial aggregation:
ans = aggregate(list(
    "on_to_senate_within_5yrs" =  0 < first_terms[["dts"]] & first_terms[["dts"]] <= 5 * 365,
    "on_to_senate_within_10yrs" =  0 < first_terms[["dts"]] & first_terms[["dts"]] <= 10 * 365,
    "on_to_senate_within_15yrs" =  0 < first_terms[["dts"]] & first_terms[["dts"]] <= 15 * 365,
    "on_to_senate_ever" = 0 < first_terms[["dts"]]),
  by = list("Fist_term_century" = first_terms[["first_term_century"]]),
  FUN=function(x) sum(x) / length(x))

print(ans)
  Fist_term_century on_to_senate_within_5yrs on_to_senate_within_10yrs on_to_senate_within_15yrs on_to_senate_ever
1                18              0.050505051                0.09764310                0.14478114        0.19191919
2                19              0.008704735                0.02437326                0.04073816        0.05675487
3                20              0.010060362                0.03286385                0.04761905        0.05678516
4                21              0.016105417                0.03074671                0.03513909        0.03660322

It's surprising how tricky it is to do this stuff! I'm not really knocking the book specifically, it's just that subtle ambiguity in language can lead to data analysis errors.

The topic of how to precisely phrase the intent of an analysis without being too pedantic is a really worthwhile thing for everyone to think about.

But circling back to your original point, the dplyr stuff is just so much more readable (for me anyway) than SQL.

iangow commented 8 months ago

Cool. Your base R is impressive (and clear)! I started with R (and RStudio) early in 2011 (I recall meeting a lonely Joe Cheng at the RStudio booth at R/Finance in May 2011) along with PostgreSQL and moved to dplyr almost as soon as it came out, so my base R is rusty. In some ways, your base R reads a lot like pandas (something else I'm rusty with).

I'll bet the difference is due to my interpretation of the word 'legislator' in the original question.

To be fair to Cathy, (apart from the thought-provoking queries) she does have a nice discussion of the US system of government in a small number of words (e.g., senators serve for 6 years, representatives for 2).

I view the dplyr (mostly dbplyr) as close enough to "pseudo-code" (not sure what my scare quotes are doing here) to be about right for the purposes of my book. (My view is that clear thinking about the question and data are often more important than the [computing] language.)

bwlewis commented 8 months ago

The quick-n-dirty R code with the time horizons had a bug of course, it was using max time to senate instead of minimum time to senate (the unbounded ratio was still correct).

I've adjusted that janky part of the code to use first time to senate per id. The numbers are closer now to the numbers in the reference but still don't agree, especially in the 21st century.

Again, it's amazing how hard it can be do something so simple like add up some numbers by group.

bwlewis commented 8 months ago

One last thing, polars is pretty well-supported for R now thanks to the astounding work of the R/rust folks:

https://rpolars.github.io/index.html

There are native, DBI, dbplyr, tidyverse-style incantations available.

For in-memory work IMO polars is about as good as it gets, and you can use the nice dplyr syntax.

iangow commented 8 months ago

Code below was adapted from here. For this use case, polars does beat DuckDB. (A more representative case might involve reading data from disk.)

For me, I wonder how polars performs when SQL runs out of runway and the benchmark becomes in-memory calculation (e.g., doing multivariate regression).

install.packages(
  'tidypolars', 
  repos = c('https://etiennebacher.r-universe.dev', getOption("repos"))
)
#> 
#> The downloaded binary packages are in
#>  /var/folders/nx/q9ytjhs52_z1pnspvfh_9n180000gn/T//RtmpZBxEI3/downloaded_packages

library(tidypolars)
#> Registered S3 method overwritten by 'tidypolars':
#>   method                 from  
#>   print.RPolarsDataFrame polars
library(polars)
library(collapse, warn.conflicts = FALSE)
#> collapse 2.0.7, see ?`collapse-package` or ?`collapse-documentation`
library(dplyr, warn.conflicts = FALSE)
library(DBI)
#> 
#> Attaching package: 'DBI'
#> The following object is masked from 'package:tidypolars':
#> 
#>     fetch
# library(bench)
# library(data.table)
# library(duckdb)

db <- dbConnect(duckdb::duckdb())

large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars_lf(large_iris)
large_iris_db <- copy_to(db, large_iris) |> compute()

format(nrow(large_iris), big.mark = ",")
#> [1] "15,000,000"

bench::mark(
  polars = {
    large_iris_pl$
      select(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))$
      with_columns(
        pl$when(
          (pl$col("Petal.Length") / pl$col("Petal.Width") > 3)
        )$then(pl$lit("long"))$
          otherwise(pl$lit("large"))$
          alias("petal_type")
      )$
      filter(pl$col("Sepal.Length")$is_between(4.5, 5.5))$
      collect()
  },
  tidypolars = {
    large_iris_pl |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |> 
      filter(between(Sepal.Length, 4.5, 5.5)) |> 
      collect()
  },
  dplyr = {
    large_iris |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      filter(between(Sepal.Length, 4.5, 5.5))
  },
  `DuckDB/dbplyr` = {
    large_iris_db |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = if_else((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      filter(between(Sepal.Length, 4.5, 5.5)) |>
      collect()
  },
  collapse = {
    large_iris |>
      fselect(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")) |>
      fmutate(
        petal_type = data.table::fifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      fsubset(Sepal.Length >= 4.5 & Sepal.Length <= 5.5)
  },
  check = FALSE,
  iterations = 40
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 5 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars         54.36ms  74.23ms    12.1    377.37KB     0   
#> 2 tidypolars     64.13ms  70.72ms    11.4    915.87KB     1.14
#> 3 dplyr            1.36s    1.41s     0.704    1.79GB     2.04
#> 4 DuckDB/dbplyr 159.23ms 169.14ms     5.88   212.51MB     2.65
#> 5 collapse      207.32ms 230.58ms     4.31   745.96MB     4.20

# NOTE: do NOT take the "mem_alloc" results into account.
# `bench::mark()` doesn't report the accurate memory usage for packages calling
# Rust code.

Created on 2024-01-10 with reprex v2.0.2

iangow commented 8 months ago

And again, but from disk:

Updated to use pl$scan_parquet() (produces a lazy table) rather than pl$read_parquet(). With the update, things look similar to above in terms of relative timing.

install.packages(
  'tidypolars', 
  repos = c('https://etiennebacher.r-universe.dev', getOption("repos"))
)
#> 
#> The downloaded binary packages are in
#>  /var/folders/nx/q9ytjhs52_z1pnspvfh_9n180000gn/T//RtmprPZZjo/downloaded_packages

library(tidypolars)
#> Registered S3 method overwritten by 'tidypolars':
#>   method                 from  
#>   print.RPolarsDataFrame polars
library(polars)
library(collapse, warn.conflicts = FALSE)
#> collapse 2.0.7, see ?`collapse-package` or ?`collapse-documentation`
library(dplyr, warn.conflicts = FALSE)
library(DBI)
#> 
#> Attaching package: 'DBI'
#> The following object is masked from 'package:tidypolars':
#> 
#>     fetch
# library(bench)
# library(data.table)
# library(duckdb)

db <- dbConnect(duckdb::duckdb())

large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars_lf(large_iris)
large_iris_pl$sink_parquet("large_iris.parquet")

format(nrow(large_iris), big.mark = ",")
#> [1] "15,000,000"

res <-
  bench::mark(
  polars = {
    pl$scan_parquet("large_iris.parquet")$
      select(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))$
      with_columns(
        pl$when(
          (pl$col("Petal.Length") / pl$col("Petal.Width") > 3)
        )$then(pl$lit("long"))$
          otherwise(pl$lit("large"))$
          alias("petal_type")
      )$
      filter(pl$col("Sepal.Length")$is_between(4.5, 5.5)) |>
      collect()
  },
  tidypolars = {
    pl$scan_parquet("large_iris.parquet") |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |> 
      filter(between(Sepal.Length, 4.5, 5.5)) |>
      collect()
  },
  dplyr = {
    arrow::read_parquet("large_iris.parquet") |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      filter(between(Sepal.Length, 4.5, 5.5))
  },
  `DuckDB/dbplyr` = {
    tbl(db, "read_parquet('large_iris.parquet')",
        check_from = FALSE) |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = if_else((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      filter(between(Sepal.Length, 4.5, 5.5)) |>
      collect()
  },
  collapse = {
    arrow::read_parquet("large_iris.parquet") |>
      fselect(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")) |>
      fmutate(
        petal_type = data.table::fifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      fsubset(Sepal.Length >= 4.5 & Sepal.Length <= 5.5)
  },
  check = FALSE,
  iterations = 40
)
#> ℹ Using `ident_q()` for a table identifier is intended as fallback in case of
#>   bugs.
#> ℹ If you need it to work around a bug please open an issue
#>   <https://github.com/tidyverse/dbplyr/issues>.
#> This message is displayed once every 8 hours.
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

# NOTE: do NOT take the "mem_alloc" results into account.
# `bench::mark()` doesn't report the accurate memory usage for packages calling
# Rust code.

res |>
  select(expression, median, total_time)
#> # A tibble: 5 × 2
#>   expression      median
#>   <bch:expr>    <bch:tm>
#> 1 polars        116.21ms
#> 2 tidypolars    101.26ms
#> 3 dplyr            1.52s
#> 4 DuckDB/dbplyr 231.95ms
#> 5 collapse      305.41ms

Created on 2024-01-10 with reprex v2.0.2

iangow commented 8 months ago
> res |>
+   select(expression, median, total_time)
# A tibble: 5 × 3
  expression      median total_time
  <bch:expr>    <bch:tm>   <bch:tm>
1 polars        101.07ms      4.26s
2 tidypolars    133.09ms       6.4s
3 dplyr            1.44s      58.2s
4 DuckDB/dbplyr 234.96ms      9.55s
5 collapse      271.25ms     11.07s
bwlewis commented 8 months ago

This is great, I will have to re-visit these notes soon (one more thing in the queue). Both DuckDB and polars have matured alot.

My intuition is polars is hard to beat in-memory, but for large out-of-core joins and similar things, DuckDB should prevail. Spilling across memory regimes gracefully has always been a strength of databases.

On 1/10/24, Ian Gow @.***> wrote:

> res |>
+   select(expression, median, total_time)
# A tibble: 5 × 3
  expression      median total_time
  <bch:expr>    <bch:tm>   <bch:tm>
1 polars        101.07ms      4.26s
2 tidypolars    133.09ms       6.4s
3 dplyr            1.44s      58.2s
4 DuckDB/dbplyr 234.96ms      9.55s
5 collapse      271.25ms     11.07s

-- Reply to this email directly or view it on GitHub: https://github.com/bwlewis/duckdb_and_r/issues/3#issuecomment-1884987329 You are receiving this because you commented.

Message ID: @.***>

bwlewis commented 8 months ago

At the risk of belaboring the CTE example, here is a different implementation that frames the problem as a Kaplan-Meier survival model (time to senate) using R's survival package, and plotting cumulative incidence curves for fraction of representatives that go on to senate by century of their first term (kinda like the plot in your note)...

library(survival)

url = paste0("https://raw.githubusercontent.com/cathytanimura/", "sql_book/master/Chapter%204%3A%20Cohorts/")
terms = read.csv(paste0(url, "legislators_terms.csv"))[, c("id_bioguide", "term_start", "term_type")]
terms[["term_start"]] = as.Date(terms[["term_start"]])
terms[["id_bioguide"]] = factor(terms[["id_bioguide"]])

# Set up a date origin
origin = as.numeric(as.Date("1789-1-1"))

# We only need a few variables...
terms = terms[, c("id_bioguide", "term_type", "term_start")]

# order by id and start date
terms = terms[order(terms[["id_bioguide"]], terms[["term_start"]]),]

# add days_from_first_term by id
terms[["d"]] = ave(as.numeric(terms[["term_start"]]) - origin, terms[["id_bioguide"]], FUN = function(x) c(0, diff(x)))

# logical version of term_type FALSE/rep TRUE/sen
terms[["senate"]] = terms[["term_type"]] == "sen"

# cut-off all rows per id after the first entry into the senate
terms = terms[ave(terms[["senate"]], terms[["id_bioguide"]], FUN=function(x) cumsum(cumsum(x)) <= 1), ]

# Add first-term century per id
terms[["first_term_century"]] = ave(1 + floor(as.numeric(format(terms[["term_start"]], "%Y"))/100), terms[["id_bioguide"]], FUN=min)

# find first and last row per id (we need the last row)...
n1 = c(1, diff(as.integer(terms[["id_bioguide"]]))) != 0
n2 = c(n1[-1], TRUE)
terms = terms[n2, ]

# omit legislators who started in the senate (see the discussion of the original question)
terms = terms[!(terms[["d"]] == 0 & terms[["senate"]]), ]

# do not right-censor data (legislators never in the senate are not right/early-censored)
terms[["d"]][!terms[["senate"]]] = max(terms[["d"]])

# A non-censored Kaplan-Meier survival model by first-term-century
s = survfit(Surv(d, senate) ~ first_term_century, data=terms)

# Survival models normally associate the event (went on to Senate in this case) with death.
# We invert that meaning (although some Senators might dispute this)...

s[["surv"]] = 1 - s[["surv"]]
# convert days to approximate years
s[["time"]] = s[["time"]] / 365

# plot the cumulative incidence by years for each group
ylim = range(s[["surv"]])
xlim = range(s[["time"]])
plot.new()
plot.window(xlim = range(s[["time"]]), ylim = range(s[["surv"]]))
axis(1, at = pretty(xlim))
axis(2, at = round(c(max(ylim), pretty(ylim)),2))
title(main = "Cumulative incidence of representatives that went on to the Senate",
     xlab = "Years after serving first term as a representative", cex.main=2, cex.lab=1.5)
cs = cumsum(s[["strata"]])
Map(function(j) {
  i = seq(max(0, cs[j-1]) + 1, cs[j])
  lines(s[["time"]][i], s[["surv"]][i], lwd=4, type="s", col=j)
}, seq_along(cs))
legend("bottomright", legend=names(cs), fill=seq_along(cs), cex=2, bty="n")
bwlewis commented 8 months ago

cte-surv

bwlewis commented 8 months ago

I guess one nice thing gained by equipping the analysis with a model, even a simple one like Kaplan-Meier, is that you can think about questions involving inference like "are the first-term-century groups significantly different from each other or not?"

The survival package also spits out some nice estiamtes of confidence intervals (Breslow or otherwise) for each curve above which would be easy to plot.

I think this kind of model is a fruitful way to think about that original question.

iangow commented 8 months ago

I think this kind of model is a fruitful way to think about that original question.

Agreed. The SQL is (or should be) building up the same curves in a different way (the motivating example I use for my "fake" chapter here is literally survival). I feel there might be value in understanding the "data steps" behind the model.

I've seen survival curves that go down (in the sense of the "upside-down" curves here) in parts, suggesting some kind of data glitch.

bwlewis commented 8 months ago

I'm plotting cumulative incidence which is (1 - survival) -- that is, for non-right-censored data. So these are technically cumulative incidence models (as indicated in the plot). Survival models can be more specialized but use the same machinery.

On 1/10/24, Ian Gow @.***> wrote:

I think this kind of model is a fruitful way to think about that original question.

Agreed. The SQL is (or should be) building up the same curves in a different way (the motivating example I use for my "fake" chapter here is literally survival). I feel there might be value in understanding the "data steps" behind the model.

I've seen survival curves that go down (in the sense of the "upside-down" curves here) in parts, suggesting some kind of data glitch.

-- Reply to this email directly or view it on GitHub: https://github.com/bwlewis/duckdb_and_r/issues/3#issuecomment-1885257459 You are receiving this because you commented.

Message ID: @.***>