WinVector / rquery

Data Wrangling and Query Generating Operators for R. Distributed under choice of GPL-2 or GPL-3 license.
https://winvector.github.io/rquery/
Other
109 stars 15 forks source link

sql_node not passing mods to query #6

Closed prescient closed 5 years ago

prescient commented 5 years ago

First thanks for the really cool package. Second I'm not sure if I'm not using the sql_node correctly (most likely) or if it is indeed not passing the mods to the query.

df <- db_td(con, 'df') %.>%
    select_rows_se(., glue("trans_date <=  str_to_date('{date_cutoff}', '%Y-%m-%d')")) %.>%
    sql_node(.,
             c("max_date" %:=% "max(trans_date)"),
             mods = "GROUP BY cust",
             orig_columns = F)
  cat(to_sql(df, con))

The resulting query is missing the group by (see below). I've tried a few different ways of specifying the group by but I couldn't find a way to make it work as I intended. If i am indeed using the mods argument incorrectly I would appreciate being pointed in the right direction.

SELECT max(trans_date) AS "max_date" FROM ( SELECT * FROM ( SELECT "cust", "trans_date", "sales", "gross_profit", "magic_col" FROM "df" ) tsql_81462090590956461376_0000000000 WHERE "trans_date" <= str_to_date ( '2017-04-02' , '%Y-%m-%d' ) ) tsql_81462090590956461376_0000000001 GROUP BY cust

JohnMount commented 5 years ago

Sorry you are having trouble. Thank you for taking the time to file an issue.

I'll try to look into this soon (and also get you a non-SQL solution as rquery has evolved quite a bit since the original instructions.

prescient commented 5 years ago

Edit: John I will close this issue. The problem is that I didn't include "cust" in the select statement as part of the SQL node. As I suspected this was a user error. Apologies for opening a non-issue! Once again thanks for your work on this!

prescient commented 5 years ago

Needed to include group by fields in the sql body.

JohnMount commented 5 years ago

Glad you got where you wanted. We have a couple of newer (easier to use) notations in the pipe. One can build up the SQL using relational operations (such as extend and project) without having to use the SQL node. Examples:

library("rquery")
library("glue")

date_cutoff <- '2017-04-02'

td <- mk_td("df", 
            c("cust",
              "trans_date",
              "sales",
              "gross_profit",
              "magic_col"))

ops <- td %.>%
  select_rows_se(., 
                 glue("trans_date <=  str_to_date('{date_cutoff}', '%Y-%m-%d')")) %.>%
  project_nse(., 
              max_date %:=% max(trans_date),
              groupby = "cust")

cat(to_sql(ops, rquery::rquery_default_db_info))

# or new rquery 1.2.1 notation (not yet up on CRAN)

ops <- td %.>%
  select_rows(., 
              trans_date <=  str_to_date(.(date_cutoff), '%Y-%m-%d'))) %.>%
  project(., 
          max_date = max(trans_date),
          groupby = "cust")

cat(to_sql(ops, rquery::rquery_default_db_info))
prescient commented 5 years ago

Thanks John that is helpful. I like the .() notation as it is a little easier than using glue. I do have one question for more idiomatic queries from a database like MonetDB. I'd note that I'm aware MonetDB is not yet supported, but I've been using it and rquery with success.

Subtracting dates in MonetDB might take the form in a select statement of: max_date - INTERVAL '1' YEAR as min_date

I've noticed that the standard evaluation nodes don't seem to like this sort of syntax but it works OK in the sql_node. Is there a good way to specify these types of extend/project statements or should I default to the sql_node for these?

JohnMount commented 5 years ago

MonetDB is on the list of nice to haves (but isn't something we have the extra time for in the near term. Until then one would have to use the sql nodes for custom syntax (sorry, I know those nodes are not as nice as the others).

prescient commented 5 years ago

Thanks John. I understand monetdb isn't supported and wanted to make sure I wasn't missing a feature that would simplify the process. I really like the flexibility of the sql_node so your work is very much appreciated.

JohnMount commented 5 years ago

Well thanks for your kind words. Definitely want to apologize if I have made anything harder than it has to be.

JohnMount commented 5 years ago

I still feel I may have made things too hard by not teaching enough (sorry!). The wrapr qe() (quote expression) and qae() operators are supposed to make working with sql_node() a bit more bearable and now include bquote() functionality.

library("rquery")

date_cutoff <- '2017-04-02'

td <- mk_td("df", 
            c("cust",
              "trans_date",
              "sales",
              "gross_profit",
              "magic_col"))

ops <- td %.>%
  select_rows_se(., 
                 qe(trans_date <= str_to_date(.(date_cutoff), '%Y-%m-%d'))) %.>%
  sql_node(.,
           qae(max_date = max(trans_date)),
           mods = "GROUP BY cust",
           orig_columns = F)

cat(to_sql(ops, rquery::rquery_default_db_info))

# SELECT
#  max(trans_date) AS "max_date"
# FROM (
#  SELECT * FROM (
#   SELECT
#    "cust",
#    "trans_date",
#    "sales",
#    "gross_profit",
#    "magic_col"
#   FROM
#    "df"
#  ) tsql_13720642024446599047_0000000000
#  WHERE "trans_date" <= str_to_date ( '2017-04-02' , '%Y-%m-%d' )
# ) tsql_13720642024446599047_0000000001 GROUP BY cust

I am trying to improve more of the pipeable SQL documentation to point this way.

JohnMount commented 5 years ago

The next version of rquery (1.2.1 should go up in late December 2018) will include a variation of the notation above where one can add ".[]" to tell rquery that a symbol is a column name and it will be an error for it not to be present (that is what all the rigmarole with quoted symbols is doing in the current version). That means the above could be written as follows.


library("rquery")

date_cutoff <- '2017-04-02'

td <- mk_td("df", 
            c("cust",
              "trans_date",
              "sales",
              "gross_profit",
              "magic_col"))

ops <- td %.>%
  select_rows_se(
    ., 
    qe(trans_date <=  str_to_date(.(date_cutoff), '%Y-%m-%d'))) %.>%
  sql_node(
    .,
    qae(max_date = max(.[trans_date])),
    mods = "GROUP BY cust",
    orig_columns = F)

cat(to_sql(ops, rquery::rquery_default_db_info))

# SELECT
#  max( "trans_date" ) AS "max_date"
# FROM (
#  SELECT * FROM (
#   SELECT
#    "cust",
#    "trans_date",
#    "sales",
#    "gross_profit",
#    "magic_col"
#   FROM
#    "df"
#  ) tsql_86277774531948168209_0000000000
#  WHERE "trans_date" <= str_to_date ( '2017-04-02' , '%Y-%m-%d' )
# ) tsql_86277774531948168209_0000000001 GROUP BY cust

Basically I usually use rquery for the relational operator nodes (I like them). But it looks like we can also improve the sql_node() notation a bit with the qae() and qe() helper functions.

prescient commented 5 years ago

Thanks for the tip on the qe and qae functions. I agree that the nodes are quite handy as they make for great building blocks when constructing large queries.

Additionally, I've noticed on some of our data sets and data transformations that we are seeing significant speedups by switching from data.table to MonetDBLite (~3x faster) while leveraging rquery. Let me know if there is something I can do to help with getting monetdb supported/integrated.

JohnMount commented 5 years ago

Thanks. I could use some testers when I get around to MonetDB (probably in January).