etiennebacher / tidypolars

Get the power of polars with the syntax of the tidyverse
https://tidypolars.etiennebacher.com
Other
141 stars 3 forks source link

Better information about handling of missing values #81

Closed etiennebacher closed 4 weeks ago

etiennebacher commented 5 months ago

polars doesn't have an equivalent to na.rm that is common in R's sum(), mean(), etc. This can lead to different results:

library(tidypolars)
#> Registered S3 method overwritten by 'tidypolars':
#>   method                 from  
#>   print.RPolarsDataFrame polars
library(dplyr, warn.conflicts = FALSE)

df <- tibble(x = c(1, 2, NA))

df |> 
  mutate(y = mean(x))
#> # A tibble: 3 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1    NA
#> 2     2    NA
#> 3    NA    NA

df |> 
  as_polars_df() |> 
  mutate(y = mean(x))
#> shape: (3, 2)
#> ┌──────┬─────┐
#> │ x    ┆ y   │
#> │ ---  ┆ --- │
#> │ f64  ┆ f64 │
#> ╞══════╪═════╡
#> │ 1.0  ┆ 1.5 │
#> │ 2.0  ┆ 1.5 │
#> │ null ┆ 1.5 │
#> └──────┴─────┘

polars doesn't have this kind of arguments on purpose (we should know the data and use describe() to see the null_count). Still, this can surprise R users when switching to polars. Should I raise a message every time? na.rm = FALSE by default so I need to see whether I can capture it in ... when I translate the expression to polars.

Should I also emphasize this in the docs of tidypolars or should this be done in upstream polars?

etiennebacher commented 5 months ago

Should I also emphasize this in the docs of tidypolars or should this be done in upstream polars?

In any case I should mention something here because not everyone will look at polars documentation

etiennebacher commented 4 months ago

https://github.com/pola-rs/polars/issues/10016

eitsupi commented 4 months ago

I suggest you look into dbplyr. dbplyr will generate a warning, I think.

etiennebacher commented 4 months ago

Thanks for the info. Reprex for later:

suppressPackageStartupMessages({
  library(dplyr)
  library(dbplyr)
})

df <- tibble(x = c(1, 2, NA))
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, df, "df", temporary = FALSE)

tbl(con, "df") |> 
  mutate(y = mean(x))
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source:   SQL [3 x 2]
#> # Database: sqlite 3.45.0 [:memory:]
#>       x     y
#>   <dbl> <dbl>
#> 1     1   1.5
#> 2     2   1.5
#> 3    NA   1.5

Created on 2024-02-12 with reprex v2.1.0.9000

etiennebacher commented 1 month ago

Maybe I could just use the has_nulls() method in a pl$when(), e.g

pl_mean <- function(x, na.rm = FALSE, ...) {
    if (isTRUE(na.rm)) {
        x$mean()
    } else {
        pl$when(x$has_nulls())$then(NA)$otherwise(x$mean())
    }
}

Need to see how it would play with rowwise() (and need to implement $has_nulls() in polars)

etiennebacher commented 4 weeks ago
library(dplyr, warn.conflicts = FALSE)
library(tidypolars)

df <- tibble(x = c(1, 2, NA))

df |> 
  mutate(y = mean(x))
#> # A tibble: 3 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1    NA
#> 2     2    NA
#> 3    NA    NA

df |> 
  as_polars_df() |> 
  mutate(y = mean(x))
#> shape: (3, 2)
#> ┌──────┬──────┐
#> │ x    ┆ y    │
#> │ ---  ┆ ---  │
#> │ f64  ┆ f64  │
#> ╞══════╪══════╡
#> │ 1.0  ┆ null │
#> │ 2.0  ┆ null │
#> │ null ┆ null │
#> └──────┴──────┘