Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.63k stars 987 forks source link

Merge fails on columns with trailing space in names #5313

Open bwlewis opened 2 years ago

bwlewis commented 2 years ago

Minimal example, tested on data.table Version: 1.14.0 and also 1.14.3 (from GitHub master, GithubSHA1: eed712ef45fd9198de6aa1ac1b672a7347253d18):

library(data.table)
x <- data.frame("a "=1:2, check.names=FALSE)
y <- data.frame("a "=1:2, b=3:4, check.names=FALSE)
merge(x, y, by=names(x))
#   a  b
# 1  1 3
# 2  2 4

merge(data.table(x), y, by=names(x))
# Error in colnamesInt(x, names(on), check_dups = FALSE) : 
# argument specifying columns specify non existing column(s): cols[1]='a'

names(data.table(x))
# [1] "a "

# The merge also fails in the data.table vernacular:
data.table(x)[y,on=names(x)]
# Error in colnamesInt(x, names(on), check_dups = FALSE) : 
#  argument specifying columns received non-existing column(s): cols[1]='a'

R session info:

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Devuan GNU/Linux 3 (beowulf)

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   
MichaelChirico commented 2 years ago

Confirmed the bug -- the issue is here:

https://github.com/Rdatatable/data.table/blob/eed712ef45fd9198de6aa1ac1b672a7347253d18/R/data.table.R#L3238

trimws() is there for a case like on = .(a >= a).

I don't have a computer with me to play around with a fix for this... maybe the whitespace should be included in the regex for >= instead?

bwlewis commented 2 years ago

Hmmm. I do understand spaces in names is bad-form but sometimes it is not easy to control.

In the non-equi join list syntax on = .(x >= y) case, I think variable names with spaces should be escaped by back-ticks anyway (that is, a fix by documentation)? Then I wonder if trimws is really necessary there?

MichaelChirico commented 2 years ago

it's necessary in the current form but definitely incorrect. I think that whole function could go for a refactor to operate on the parse tree rather than string representation.