WinVector / rquery

Data Wrangling and Query Generating Operators for R. Distributed under choice of GPL-2 or GPL-3 license.
https://winvector.github.io/rquery/
Other
109 stars 15 forks source link

Feature Request: Natural_Join on columns with different names. #12

Closed NompuK closed 5 years ago

NompuK commented 5 years ago

When I try to do this:

dbopts <- rq_connection_tests(con)
db <- rquery_db_info(connection = con,
                     is_dbi = TRUE,
                     connection_options = dbopts)

table1_td <- db_td(db, "table1 ", qualifiers = c(schema = "schema1"))
table2_td <- db_td(db, "table2 ", qualifiers = c(schema = "schema2"))

join_sql  <- table1_td %.>%
                  natural_join(. , table2_td , by = c("id1" = "id2")) 

I get this error:

Error in natural_join.relop(., table2_td , by = c(id1 = "id2")) : rquery::natural_join.relop all tables must have all join keys, the following keys are not in some tables: id2

From the error message I take the insight that it is not possible to join two tables on columns with different names. Trying with theta_join() didn't work either, when using the = predicate (using any other it worked). My workaround was so far to rename the columns in one table for them to match the column names in the other table.

If I'm wrong, I'd be happy to be shown a solution. If that's indeed the case then this is probably a feature request.

JohnMount commented 5 years ago

You are correct: for natural join the keys must have the same name in both tables (also true for theta-join). I'll think on this as a feature request, I have some ideas how to fit it in naturally.

JohnMount commented 5 years ago

Thanks for your help and patience with this project.

I am liking a solution where the software inserts an explicit rename() node for the right table if needed. Here is an example of the new (rquery 1.3.6) capability.

library("rquery")
library("wrapr")

raw_connection <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::initExtension(raw_connection)
db <- rquery_db_info(
  connection = raw_connection,
  is_dbi = TRUE,
  connection_options = rq_connection_tests(raw_connection))

t1 <- rq_copy_to(db, "t1", data.frame(a = c(1, 2), b = c(3, 4)))
t2 <- rq_copy_to(db, "t2", data.frame(c = c(2, 1), d = c(6, 5)))

ops <- natural_join(t1, t2, by = c("a" = "c"))
cat(format(ops))
#> table(`t1`; 
#>   a,
#>   b) %.>%
#>  natural_join(.,
#>   table(`t2`; 
#>     c,
#>     d) %.>%
#>    rename(.,
#>     c('a' = 'c')),
#>   j= INNER, by= a)

execute(db, ops)
#>   a b d
#> 1 1 3 5
#> 2 2 4 6

DBI::dbDisconnect(raw_connection)
NompuK commented 5 years ago

Awesome. I think that is a good solution. It's intuitive and people know it from dplyr.

JohnMount commented 5 years ago

Great, the feature is committed and will go to CRAN in the next release (probably a month).