IMSMWU / RClickhouse

A 'DBI' Interface to the Yandex Clickhouse Database Providing Basic 'dplyr' Support
GNU General Public License v2.0
92 stars 25 forks source link

Semi join #103

Open lucasxteixeira opened 11 months ago

lucasxteixeira commented 11 months ago

Hi,

I recently encountered an issue while using dplyr::semi_join with Clickhouse. The default code generated by dplyr produces a subquery with dependencies, and this isn't supported in Clickhouse (or am I wrong?). However, I noticed that Clickhouse does support LEFT SEMI JOIN. Consequently, I've wrote the following function to address this:

#' @export
#' @importFrom dbplyr sql_query_semi_join
sql_query_semi_join.ClickhouseConnection <- function(con, x, y, anti, by, where, vars, ..., lvl = 0) {

  x <- dbplyr:::dbplyr_sql_subquery(con, x, name = by$x_as, lvl = lvl)
  y <- dbplyr:::dbplyr_sql_subquery(con, y, name = by$y_as, lvl = lvl)

  on <- dbplyr:::sql_join_tbls(con, by)

  JOIN <- ifelse(anti, dplyr::sql("ANTI LEFT JOIN"), dplyr::sql("SEMI LEFT JOIN"))

  # Wrap with SELECT since callers assume a valid query is returned
  clauses <- list(
    dbplyr:::sql_clause_select(con, vars),
    dbplyr:::sql_clause_from(x),
    dbplyr:::sql_clause(JOIN, y),
    dbplyr:::sql_clause("ON", on, sep = " AND", parens = TRUE, lvl = 1)
  )
  dbplyr:::sql_format_clauses(clauses, lvl, con)
}

Nonetheless, I'm aware that my function uses some internal dbplyr functions, and I'm uncertain about the permissibility of this approach. Could someone provide some directions on how to refine this function for a potential PR?

Thank you in advance.

inkrement commented 10 months ago

Hi Lucas! We appreciate your help! The package comprises two parts: (1) the Clickhouse-Driver and (2) the dbplyr interface. Your code does dplyr-code generation and, thus, targets the second part. We started to develop the package at a time when Clickhouse's Join-support was super basic, not in line with the standards, and highly restricted, but that changed a lot. Consequently, you can do a much better job than we did.

I am thrilled that you want to invest time here, and a pull request is welcome. Ideally, you would adapt the R/dplyr.R file, as this is where the dplyr-specific code is located. But here are a few more settings to keep in mind: