Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 977 forks source link

revdep neonPlantEcology example/vignette failures with new dcast coercion #5980

Closed tdhock closed 4 months ago

tdhock commented 6 months ago

looks like #4586 caused a revdep to start failing checks:

* checking examples ... ERROR
Running examples in 'neonPlantEcology-Ex.R' failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: npe_summary
> ### Title: Get plant biodiversity information for NEON plots
> ### Aliases: npe_summary
> 
> ### ** Examples
> 
> data("D14")
> plot_level <- neonPlantEcology::npe_summary(neon_div_object = D14, scale = "plot")
Loading required namespace: vegan
Error in dcast.data.table(`_DT14`[, .(site, plotID, subplotID, eventID,  : 
  'x' is not atomic
Calls: <Anonymous> ... as_tibble -> dt_eval -> eval_tidy -> dcast -> dcast.data.table
Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in 'inst/doc' ... OK
* checking running R code from vignettes ...
  'neonPlantEcology.Rmd' using 'UTF-8'... OK
  'using_npe.Rmd' using 'UTF-8'... OK
 NONE
* checking re-building of vignette outputs ... ERROR
Error(s) in re-building vignettes:
  ...
--- re-building 'neonPlantEcology.Rmd' using rmarkdown

Quitting from lines 107-122 [nspp] (neonPlantEcology.Rmd)
Error: processing vignette 'neonPlantEcology.Rmd' failed with diagnostics:
'x' is not atomic
--- failed re-building 'neonPlantEcology.Rmd'

--- re-building 'using_npe.Rmd' using rmarkdown

Quitting from lines 107-122 [nspp] (using_npe.Rmd)
Error: processing vignette 'using_npe.Rmd' failed with diagnostics:
'x' is not atomic
--- failed re-building 'using_npe.Rmd'

SUMMARY: processing the following files failed:
  'neonPlantEcology.Rmd' 'using_npe.Rmd'

Error: Vignette re-building failed.
Execution halted
tdhock commented 6 months ago

@MichaelChirico any idea why this may be happening?

MichaelChirico commented 6 months ago

It looks like here's an MRE:

DT <- data.table(a = c(1,2,1), b = 1:3, c = 0)
dcast(DT, a~b, value.var="c", fill=list(0))

## ON master
# Error in dcast.data.table(DT, a ~ b, value.var = "c", fill = list(0)) : 
#   'x' is not atomic

## ON 1.15.2
dcast(DT, a~b, value.var="c", fill=list(0))
# Key: <a>
#        a     1     2     3
#    <num> <num> <num> <num>
# 1:     1     0     0     0
# 2:     2     0     0     0

FYI @hadley it looks like this is coming from a {dtplyr} translation of pivot_wider providing fill=list() to the dcast() call, which is a bit unusual, I wonder if the translation is WAI? Sorry I haven't dived into this aspect any further.

https://github.com/admahood/neonPlantEcology/blob/2dd814157e5804174ded6c022100b1903ca883c2/R/diversity_data_prep.R#L1438-L1447

MichaelChirico commented 6 months ago

@jangorecki should we (1) extend coerceAs() to behave more like coerceVector() for VECSXP input (2) special-case the handling of !isVectorAtomic(thisfill) or (3) make the breaking change not to allow fill = list() to get coerced as before?

tdhock commented 6 months ago

I would have expected some kind of error message like "value.var is numeric but fill is list, so please change fill type to match value.var type"

related: with data.table release it is OK for fill=list when value.var is VECSXP, see below

> (dt=data.table(i=1:2, j=c("a","b"), L=list(4, 5:6), num=7:8))
       i      j      L   num
   <int> <char> <list> <int>
1:     1      a      4     7
2:     2      b    5,6     8
> dcast(dt, i ~ j, value.var="L")
Key: <i>
       i      a      b
   <int> <list> <list>
1:     1      4       
2:     2           5,6
> dcast(dt, i ~ j, value.var="L", fill=NA)
Key: <i>
       i      a      b
   <int> <list> <list>
1:     1      4     NA
2:     2     NA    5,6
> dcast(dt, i ~ j, value.var="L", fill=list(c("miss","ing"),"value"))
Key: <i>
       i        a        b
   <int>   <list>   <list>
1:     1        4 miss,ing
2:     2 miss,ing      5,6

and it is ok for fill to be list with elements of length=1, see below

> dcast(dt, i ~ j, value.var="num")
Key: <i>
       i     a     b
   <int> <int> <int>
1:     1     7    NA
2:     2    NA     8
> dcast(dt, i ~ j, value.var="num", fill=0)
Key: <i>
       i     a     b
   <int> <int> <int>
1:     1     7     0
2:     2     0     8
> dcast(dt, i ~ j, value.var="num", fill=list(0,1))
Key: <i>
       i     a     b
   <int> <int> <int>
1:     1     7     0
2:     2     0     8

but not OK for list element to be a vector of length>1, see below.

> dcast(dt, i ~ j, value.var="num", fill=list(0:1))
Error in dcast.data.table(dt, i ~ j, value.var = "num", fill = list(0:1)) : 
  'list' object cannot be coerced to type 'integer'

also in the above examples it seems that if fill=list with more than one element, only the first element is used, and the others are ignored, so I would have expected a warning in this case, like "fill is a list with 2 elements, but only the first element is used, so please remove the other elements of fill"

MichaelChirico commented 6 months ago

IINM all of the existing behavior is just inherited from base R coercion, c.f.

as.integer(list(0))      # 0L
as.integer(list(0,1))    # 0:1
as.integer(list(0:1))    # 'cannot be coerced' error
as.integer(list(0, 1:2)) # 'cannot be coerced' error
jangorecki commented 6 months ago

@jangorecki should we (1) extend coerceAs() to behave more like coerceVector() for VECSXP input (2) special-case the handling of !isVectorAtomic(thisfill) or (3) make the breaking change not to allow fill = list() to get coerced as before?

It wasn't in its original scope but could be added.

jangorecki commented 6 months ago

If it is regression (as per first post) then it qualifies for patch release.

MichaelChirico commented 6 months ago

If it is regression (as per first post) then it qualifies for patch release.

This is still in dev, not any CRAN release

hadley commented 6 months ago

Sorry, I don't remember enough about this function or dtplyr to offer any useful feedback 😞

MichaelChirico commented 6 months ago

@TysonStanley from {tidyfast} I think you may be most familiar with {tidyr} <-> {data.table} translation, if you wouldn't mind checking how pivot_wider() is being translated to dcast() to see if there's any improvement we can recommend.

TysonStanley commented 6 months ago

Yep, can take a closer look.

TysonStanley commented 6 months ago

A relatively quick review suggests the list option isn't working in dtplyr::pivot_wider() as intended. The way pivot_wider() normally works is you can give the values_fill arg a named list that applies the values differentially to each named variable. However, in the current dtplyr it just uses the first value encountered.

library(data.table)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(dtplyr)
library(tibble)

df <- lazy_dt(tibble(g = c(1, 2), var = c("x", "y"), val = c(1, 2), val2 = c(1, 2)))
dt = as.data.table(df)

# shows where NAs are and what the call looks like
df %>%
  pivot_wider(names_from = var, values_from = val)
#> Source: local data table [2 x 4]
#> Call:   dcast(`_DT1`, formula = g + val2 ~ var, value.var = "val")
#> 
#>       g  val2     x     y
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     1    NA
#> 2     2     2    NA     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

# should fill NA values for val with 0 and val2 with 10 but fills both with 0
df %>%
  pivot_wider(names_from = var, values_from = c(val, val2), values_fill = list(val = 0, val2 = 10))
#> Source: local data table [2 x 5]
#> Call:   dcast(`_DT1`, formula = g ~ var, value.var = c("val", "val2"), 
#>     fill = list(val = 0, val2 = 10))
#> 
#>       g val_x val_y val2_x val2_y
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
#> 1     1     1     0      1      0
#> 2     2     0     2      0      2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

# does the same thing as above
df %>%
  pivot_wider(names_from = var, values_from = c(val, val2), values_fill = 0)
#> Source: local data table [2 x 5]
#> Call:   dcast(`_DT1`, formula = g ~ var, value.var = c("val", "val2"), 
#>     fill = 0)
#> 
#>       g val_x val_y val2_x val2_y
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>
#> 1     1     1     0      1      0
#> 2     2     0     2      0      2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

# does the same thing as above
dcast(dt, formula = g ~ var, value.var = c("val", "val2"), fill = 0)
#> Key: <g>
#>        g val_x val_y val2_x val2_y
#>    <num> <num> <num>  <num>  <num>
#> 1:     1     1     0      1      0
#> 2:     2     0     2      0      2

Created on 2024-03-10 with reprex v2.1.0

I assume we could add functionality for a list to work like dplyr::pivot_wider() likely somewhere around here

Can also PR a fix in the documentation for dtplyr if we aren't going to add this functionality.

MichaelChirico commented 6 months ago

Thanks for the investigation! I wondered if that might be what's going on. So, probably {dtplyr} should for now warn if length(values_fill) > 1L, and actually send values_fill[[1L]] (hand-waving away the case where the target type is list where we need to be more careful), since {data.table} only supports one fill= value for the full output.

That said, offering support for multiple fill= values seems reasonable enough, though there's not been any request for that behavior thus far. Filed #5990 as an FR to that end.

tdhock commented 6 months ago

this seems to be fixed now on the revdep server at least. there was a release of dtplyr recently, but its NEWS does not seem to be relevant.

dtplyr 1.3.1
    Fix for failing R CMD check.
    dtplyr no longer directly depends on crayon.

did somebody fix this somehow? should we close?

MichaelChirico commented 6 months ago

there was a release of dtplyr recently

I guess that was a mis-read, last {dtplyr} release was March 22, 2023

MichaelChirico commented 6 months ago

I see a more recent release (late Feb) of {neonPlantEcology}, but also no relevant change I can see at a glance.

tdhock commented 5 months ago

I double checked and I still see the error on my laptop, using data.table master, so I guess this is a revdep check server issue, sorry for the noise.