danzafar / tidyspark

tidyspark: a tidyverse implementation of SparkR built for simplicity, elegance, and ease of use.
Other
22 stars 0 forks source link

Implement repartition and coalesce #40

Closed danzafar closed 4 years ago

danzafar commented 4 years ago

A good beginner task, at least I think....

danzafar commented 4 years ago

OK now I’m seeing where some coalesce confusion is coming from. It turns out that dplyr::coalesce is going to be a namespace conflict. dplyr::coalesce look like:

> dplyr::coalesce
function (...) 
{
    if (missing(..1)) {
        abort("At least one argument must be supplied")
    }
    values <- list2(...)
    x <- values[[1]]
    values <- values[-1]
    for (i in seq_along(values)) {
        x <- replace_with(x, is.na(x), values[[i]], glue("Argument {i + 1}"), 
            glue("length of {fmt_args(~x)}"))
    }
    x
}

which is definitely not going to be flexible for accepting a spark_tbl. We can either try to hack the API somehow or just accept the namespace conflict for coalesce and write this method:

coalesce.data.frame <- function(...) {
  dplyr::coalesce(...)
}

which will dispatch a data.frame (or tbl) to the right place

danzafar commented 4 years ago

@jcamstan3370 as we discussed, for the dplyr coalesce the first value will have to be a Column value for it to work on spark_tbls. In case the first value needs to be a constant (not sure how that's possible), I added an as.Column (also lit and as_Column) function to help out with this. See #47 .