Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

multiple column update leaves factors -> date as NA #3465

Closed billcgold closed 5 years ago

billcgold commented 5 years ago

Long time data.table user. It is an awesome package, thank you. first time issue post.

library(data.table)
# create dummy dt
nasdaq.data <- data.table (Date = seq(as.Date("1/1/1995",format='%m/%d/%Y'), as.Date("12/31/2005",format='%m/%d/%Y'), "days")
                           , open = runif(n=4018,min=1000,max=2000)
                           , close = runif(n=4018,min=1000,max=2000))
# simulate factor read in from CSV file
nasdaq.data [ , Date := as.Date ( as.character (Date), format='%Y-%m-%d' ) ]

# convert factor to date + create other new variables
nasdaq.data [ , `:=` (  Date = as.Date ( as.character (Date), format='%m/%d/%y' )
                        , Year = year(Date)
                        , Month = month(Date)
                        , Quarter = quarter(Date)
                        , n = 1
) ]

# first column Date contains <NA>
head(nasdaq.data)

The first column (Date) contains values NA

> head(nasdaq.data)
   Date     open    close Year Month Quarter n
1: <NA> 1124.131 1055.144 1995     1       1 1
2: <NA> 1218.287 1856.855 1995     1       1 1
3: <NA> 1105.849 1505.512 1995     1       1 1
4: <NA> 1444.034 1731.447 1995     1       1 1
5: <NA> 1890.109 1877.172 1995     1       1 1
6: <NA> 1427.534 1580.483 1995     1       1 1

Update with the same data and only the factor -> updated of Date works as expected.

# create dummy dt, again
nasdaq.data <- data.table (Date = seq(as.Date("1/1/1995",format='%m/%d/%Y'), as.Date("12/31/2005",format='%m/%d/%Y'), "days")
                           , open = runif(n=4018,min=1000,max=2000)
                           , close = runif(n=4018,min=1000,max=2000))

# simulate factor read in from CSV file, again
nasdaq.data [ , Date := as.factor(Date)]

# convert factor to date, again
nasdaq.data [ , Date := as.Date ( as.character (Date), format='%Y-%m-%d' ) ]

# this update of Date works
head(nasdaq.data)

> head(nasdaq.data)
         Date     open    close
1: 1995-01-01 1066.070 1836.182
2: 1995-01-02 1783.369 1554.454
3: 1995-01-03 1850.504 1316.359
4: 1995-01-04 1763.994 1228.194
5: 1995-01-05 1330.163 1849.493
6: 1995-01-06 1560.824 1362.692

sessionInfo follows

> print(sessionInfo())
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] animation_2.5     jpeg_0.1-8        data.table_1.12.0 bindrcpp_0.2      stringr_1.3.1     dplyr_0.7.4       purrr_0.2.3       readr_1.0.0      
 [9] tidyr_0.7.1       tibble_2.0.1      tidyverse_1.1.1   scales_0.5.0      MLmetrics_1.1.1   lubridate_1.6.0   ggplot2_2.2.1     forecast_8.5     
[17] shiny_1.1.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17      lattice_0.20-35   zoo_1.8-0         assertthat_0.2.0  digest_0.6.15     lmtest_0.9-34     psych_1.7.5       mime_0.5         
 [9] R6_2.2.2          plyr_1.8.4        httr_1.2.1        pillar_1.3.1      rlang_0.3.1       lazyeval_0.2.0    curl_2.8.1        readxl_0.1.1     
[17] rstudioapi_0.6    fracdiff_1.4-2    TTR_0.23-2        DT_0.4            labeling_0.3      foreign_0.8-70    htmlwidgets_1.2   munsell_0.4.3    
[25] broom_0.4.2       compiler_3.4.4    httpuv_1.4.3      modelr_0.1.1      pkgconfig_2.0.2   mnormt_1.5-5      urca_1.3-0        htmltools_0.3.6  
[33] nnet_7.3-12       quadprog_1.5-5    crayon_1.3.4      later_0.7.2       nlme_3.1-137      jsonlite_1.6      xtable_1.8-2      gtable_0.2.0     
[41] magrittr_1.5      quantmod_0.4-10   stringi_1.2.2     reshape2_1.4.3    promises_1.0.1    tseries_0.10-46   timeDate_3012.100 xml2_1.0.0       
[49] xts_0.10-0        tools_3.4.4       forcats_0.2.0     glue_1.2.0        crosstalk_1.0.0   hms_0.2           parallel_3.4.4    yaml_2.1.19      
[57] colorspace_1.3-2  rvest_0.3.2       bindr_0.1         haven_1.1.0    

3/21 minor word edits for clarity

franknarf1 commented 5 years ago

Because the Date column is overwritten so many times, maybe you got lost in the transformations? Retracing your steps in the first code chunk (just for the first row), I see:

library(data.table)
# create dummy dt
nasdaq.data <- data.table (Date = seq(as.Date("1/1/1995",format='%m/%d/%Y'), as.Date("12/31/2005",format='%m/%d/%Y'), "days")
                           , open = runif(n=4018,min=1000,max=2000)
                           , close = runif(n=4018,min=1000,max=2000))

# simplify
(DT <- head(nasdaq.data, 1))
#          Date     open    close
# 1: 1995-01-01 1928.344 1934.387

# do nothing, copying from OP's first chunk
DT[ , Date2 := as.Date ( as.character (Date), format='%Y-%m-%d' ) ][]
#          Date     open    close      Date2
# 1: 1995-01-01 1928.344 1934.387 1995-01-01

# accidentally mal-format, copying from OP's first code chunk
DT[, Date3 := as.Date ( as.character (Date2), format='%m/%d/%y' )][]
#          Date     open    close      Date2 Date3
# 1: 1995-01-01 1928.344 1934.387 1995-01-01  <NA>

If there's something else going on, and it's data.table-specific, it might help to provide a clearer example.

billcgold commented 5 years ago

Frank thanks for the response. I can simplify the code. The issue occurs when multiple columns (Date. Year, Month, Quarter, n) are all updated in one DT command.

DT [ , `:=` (  Date = as.Date ( as.character (Date), format='%m/%d/%y' )
                        , Year = year(Date)
                        , Month = month(Date)
                        , Quarter = quarter(Date)
                        , n = 1 ]
billcgold commented 5 years ago

As suggested by @franknarf1 here is a clearer and simpler reproducible code sample

library(data.table)

DT <- data.table (date = seq(as.Date("1/1/1995",format='%m/%d/%Y'), as.Date("12/31/2005",format='%m/%d/%Y'), "days"))

# simulate factor as read in from CSV file
DT [ , date := as.Date ( as.character (date), format='%Y-%m-%d' ) ]

# convert factor to date + create other new variables
DT [ , `:=` (  date = as.Date ( as.character (date), format='%m/%d/%y' )
                        , year = year(date)

) ]

# Date contains <NA>
head(DT)

> head(DT)

   date year
1: <NA> 1995
2: <NA> 1995
3: <NA> 1995
4: <NA> 1995
5: <NA> 1995
6: <NA> 1995
billcgold commented 5 years ago

reopening, was closed unintentionally

jangorecki commented 5 years ago

I can reproduce same behaviour in base R

date = seq(as.Date("1/1/1995",format='%m/%d/%Y'), as.Date("12/31/2005",format='%m/%d/%Y'), "days")
date = as.Date ( as.character (date), format='%Y-%m-%d' )
as.Date ( as.character (date), format='%m/%d/%y' )

if NAs are the problem for you then I think it is a because of incorrect use of format argument and note there are no factors there, only characters

MichaelChirico commented 5 years ago

I agree with Jan. Please post on StackOverflow if there's anything still unclear, I don't think this is a data.table issue.