danzafar / tidyspark

tidyspark: a tidyverse implementation of SparkR built for simplicity, elegance, and ease of use.
Other
22 stars 0 forks source link

Issue with `filter(col = max(col))` #17

Closed danzafar closed 4 years ago

danzafar commented 4 years ago

Running the following type of windowed filter:

flights_tbl %>% 
    select(flight, dep_time, dep_delay) %>% 
    group_by(flight) %>% 
    filter(dep_delay == max(dep_delay))

Results in error:

Error in handleErrors(returnStatus, conn) : 
  org.apache.spark.sql.AnalysisException: filter expression '`agg_col0`' of type double is not a boolean.;;
danzafar commented 4 years ago

Looks like this is being caused by some unexpected tidy eval behavior:

Browse[1]> quo_sub
<quosure>
expr: ^(dep_delay = agg_col0)
env:  0x7febe1c4a7b0
Browse[1]> df_cols_update
$flight
Column flight 

$dep_time
Column dep_time 

$dep_delay
Column dep_delay 

$agg_col0
Column agg_col0 

Browse[1]> rlang::eval_tidy(quo_sub, df_cols_update)
Column agg_col0 
danzafar commented 4 years ago

scratch that, the quosure seems to be malformed. For arbitrary columns being compared:

<quosure>
expr: ^Sepal_Length == Sepal_Width
env:  0x7f8d7ed1f300

but for our operation:

<quosure>
expr: ^(Sepal_Length = agg_col0)
env:  0x7f8d995dc2b8

This is being caused by fix_dot returning the string representation of the valid spark operation. Have fixed a number of unrelated issues while investigating this, in case there is going to be a drastic change to fix_dots

danzafar commented 4 years ago

In fix_dots I added in two lines that should have solved this:

    } else if (identical(op, `==`)) {
      paste(fix_dot(args[[1]], env), "==", fix_dot(args[[2]], env))

But unfortunately now we hit a deeper error with spark's collect on this:

java.lang.UnsupportedOperationException: Cannot evaluate expression: max(input[0, double, false])
danzafar commented 4 years ago

This seems to be the same issue as #18

danzafar commented 4 years ago

Solved with same solution to #19