DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

dbplyr summarise translation to SQL #153

Closed ellmanj closed 2 years ago

ellmanj commented 2 years ago

Hello again! I've found what I think is another issue around translating to SQL.

When I try to use the summarise function to get the median value for a column, I get the following error.

> query <- dplyr::tbl(db_connection, "table_name") %>% summarise(median_value=median(column_name))
> query %>% collect()
Error: InvalidRequestException (HTTP 400). line 1:42: mismatched input '('. Expecting: 'BY'

Here is the SQL being executed:

> query %>% show_query()
<SQL>
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "column_name") AS "median_value"
FROM "table_name"

So I believe the issue is that PERCENTILE_CONT is not a valid Presto function. Perhaps we could use one of the approx_percentile functions here?

DyfanJones commented 2 years ago

Hi @ellmanj thanks for highlighting this issue. I will check this out :)

DyfanJones commented 2 years ago

So percentile_cont is coming from dbplyr::base_win, dbplyr::base_agg and dbplyr::base_no_win. Will replace these going forward.

DyfanJones commented 2 years ago

Branch sql_tran_median seems to have fixed the issue:

remotes::install_github("DyfanJones/noctua", ref = "sql_tran_median")
library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

# Create iris data set in AWS Athena
dbWriteTable(con, "iris", iris)

tbl(con, "iris") %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 3]
# Database: Athena 0.1.11 [default@eu-west-1/default]
  lower_percentile median upper_percentile
             <dbl>  <dbl>            <dbl>
1             1.6    4.4              5.1

tbl(con, "iris") %>%
  group_by(species) %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 4]
# Database: Athena 0.1.11 [default@eu-west-1/default]
  species    lower_percentile median upper_percentile
  <chr>                 <dbl>  <dbl>            <dbl>
1 virginica               5.1    5.6              5.9
2 setosa                  1.4    1.5              1.6
3 versicolor              4      4.4              4.6

@ellmanj Please let me know if this fixes the issue.

ellmanj commented 2 years ago

thanks @DyfanJones! I tried it out. It works if I don't supply the na.rm parameter to the median function. But in my current project in some places we do supply that parameter.

Example:

> query <- dplyr::tbl(db_connection, "person") %>% summarise(median_value=median(age, na.rm = TRUE))
> query %>% collect()
Error in median(age, na.rm = TRUE) : unused argument (na.rm = TRUE)

Any thoughts?

DyfanJones commented 2 years ago

@ellmanj Ah thanks, I have that behaviour missing. I will add it so that it mimics dbplyr

DyfanJones commented 2 years ago

@ellmanj add to parameter na.rm to median sql_translate_env`. Please have a go.

library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

# Create iris data set in AWS Athena
dbWriteTable(con, "iris", iris)

tbl(con, "iris") %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 3]
# Database: Athena 0.1.11 [default@eu-west-1/default]
lower_percentile median upper_percentile
<dbl>  <dbl>            <dbl>
  1             1.6    4.4              5.1
# Warning message:
# Missing values are always removed in SQL.
# Use `median(x, na.rm = TRUE)` to silence this warning
# This warning is displayed only once per session. 

tbl(con, "iris") %>%
  group_by(species) %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length, na.rm = TRUE),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 4]
# Database: Athena 0.1.11 [default@eu-west-1/default]
species    lower_percentile median upper_percentile
<chr>                 <dbl>  <dbl>            <dbl>
1 virginica               5.1    5.6              5.9
2 setosa                  1.4    1.5              1.6
3 versicolor              4      4.4              4.6
ellmanj commented 2 years ago

@DyfanJones that worked! Thanks for the quick turnaround on this.

DyfanJones commented 2 years ago

@ellmanj PR #154 has been merged to master. Will release to the cran roughly on the 27/08 due to cran policy of monthly releases.

DyfanJones commented 2 years ago

Will push these changes to cran tomorrow. Sorry around the delay I had a short holiday so these changes got delayed abit

ellmanj commented 2 years ago

No worries at all. Thank you!

On Thu, Sep 9, 2021, 9:23 AM Larefly @.***> wrote:

Will push these changes to cran tomorrow. Sorry around the delay I had a short holiday so these changes got delayed abit

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DyfanJones/noctua/issues/153#issuecomment-916145748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4FPLT4PD4KTGAGOCGXHNLUBC7MTANCNFSM5BNA7TMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

DyfanJones commented 2 years ago

Sorry the release was delayed by #156 and how to best implement it. I will update RAthena and continue with the release.

ellmanj commented 2 years ago

Ok thanks for the update.

DyfanJones commented 2 years ago

noctua v-2.2.0 has now been released to the cran, thanks for your wait :D