config-i1 / greybox

Regression model building and forecasting in R
30 stars 7 forks source link

lmCombine() error when training data gets to a certain size, even with bruteForce = FALSE #33

Closed leungi closed 5 years ago

leungi commented 5 years ago

As per subject; example shown below.

Using greybox_0.4.1.

> dim(data)
[1] 1474   27

> head(data)
          y       x1       x2       x3       x4       x5       x6       x7       x8       x9      x10      x11      x12      x13      x14      x15      x16      x17      x18
1 1.1342020 1031.898 25.33311 169.4480 106.6904 26.70383 2715.493 10.43662 1280.008 84.91353 604.4532 176.3360 192.9604 108.7097 1046.748 34.57609 150.1286 105.3070 37.58345
2 1.0490129 1028.735 29.79364 159.7876 104.9702 25.10849 2712.597 10.97205 1238.919 85.18453 604.2857 170.3515 186.3013 108.7397 1031.898 25.33311 169.4480 106.6904 26.70383
3 1.1446238 1044.398 27.95415 150.3061 104.4126 31.20778 2693.927 10.83171 1204.779 85.84771 592.2333 157.8295 174.7717 107.6142 1028.735 29.79364 159.7876 104.9702 25.10849
4 0.9843351 1038.720 30.00345 145.5703 105.0761 28.00348 2675.840 10.90912 1218.317 87.26625 597.8358 156.0761 171.6895 110.0136 1044.398 27.95415 150.3061 104.4126 31.20778
5 0.9213088 1043.121 30.22741 148.4236 105.0712 29.02609 2656.345 10.97350 1212.080 88.17706 597.5838 159.8943 170.6431 110.4755 1038.720 30.00345 145.5703 105.0761 28.00348
6 0.8701861 1052.820 23.28180 160.5087 104.3836 23.47458 2704.388 10.16553 1287.438 87.92422 595.7473 169.7416 181.6210 106.4576 1043.121 30.22741 148.4236 105.0712 29.02609
       x19      x20      x21      x22      x23      x24      x25      x26
1 2704.123 10.81376 1222.988 79.57888 604.1358 164.6718 186.6565 105.6083
2 2715.493 10.43662 1280.008 84.91353 604.4532 176.3360 192.9604 108.7097
3 2712.597 10.97205 1238.919 85.18453 604.2857 170.3515 186.3013 108.7397
4 2693.927 10.83171 1204.779 85.84771 592.2333 157.8295 174.7717 107.6142
5 2675.840 10.90912 1218.317 87.26625 597.8358 156.0761 171.6895 110.0136
6 2656.345 10.97350 1212.080 88.17706 597.5838 159.8943 170.6431 110.4755

> lmCombine(data[1:50, ], bruteForce = FALSE)
Call:
lmCombine(data = data[1:50, ], bruteForce = FALSE, formula = y ~ 
    .)

Coefficients:
  (Intercept)            x2           x12            x6           x16            x1 
 1.3146239493  0.0225953692  0.0037014722 -0.0004804042  0.0012738700 -0.0004843076 

> lmCombine(data[1:1500, ], bruteForce = FALSE)
Error in cbind(y, model.matrix(cl$formula, data = data[rowsSelected, ])[,  : 
  number of rows of matrices must match (see arg 2)

> lmCombine(data, bruteForce = FALSE)
 logLik(ourModel) : object 'ourModel' not found greybox
config-i1 commented 5 years ago

Just to clarify:

  1. Have you tried this on greybox 0.5.0? It has been submitted to CRAN few days ago and should be available soon. The other option would be to install it from github.
  2. Does this happen all the time with any type of data (so, would I be able to reproduce this with ranodmly generated variables)?
  3. Is there any chance that this has something to do with the fact that you are selecting 1:1500 rows in the data that has 1474 rows?
leungi commented 5 years ago

Thanks for prompt reply @config-i1.

My apologies for Q2; my mistake on the indexing 😅

To Q1, just upgraded to 0.5.0, but same issue.

Reproducible example provided below.

library(greybox)
#> Warning: package 'greybox' was built under R version 3.5.3
#> Package "greybox", v0.5.0 loaded.
url <-'https://raw.githubusercontent.com/leungi/datasets/master/greybox_debug_data.csv'

data <- readr::read_csv(url)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#>   .default = col_double()
#> )
#> See spec(...) for full column specifications.

# drop first index column
data <- data[ ,-1]

# check for NA
sum(is.na(data))
#> [1] 0

GreyboxModel <- function(data, model) {
  switch(model,
         stepwise = stepwise(data),
         lmCombine = lmCombine(data, bruteForce = FALSE),
         lmDynamic = lmDynamic(data, bruteForce = FALSE),
         lmCombineBF = lmCombine(data, bruteForce = TRUE),
         lmDynamicBF = lmDynamic(data, bruteForce = TRUE))
}

# OK
GreyboxModel(data, 'stepwise')
#> Call:
#> alm(formula = y ~ x2 + x6 + x3 + x8 + x22 + x5 + x15 + x26 + 
#>     x12 + x14 + x21 + x20 + x9 + x19 + x4 + x13 + x11, data = data, 
#>     distribution = "dnorm")
#> 
#> Coefficients:
#>   (Intercept)            x2            x6            x3            x8 
#>  2.582548e-01  9.186865e-03 -4.949338e-04 -2.322507e-03  1.800258e-04 
#>           x22            x5           x15           x26           x12 
#> -5.668228e-04  7.809159e-03  3.078416e-03  2.716742e-03  8.008150e-04 
#>           x14           x21           x20            x9           x19 
#>  3.360904e-04  5.892103e-06  3.963442e-05  2.614113e-03  2.833173e-04 
#>            x4           x13           x11 
#> -3.453556e-03 -4.095078e-03  4.601871e-03

# OK
GreyboxModel(data, 'lmDynamic')
#> Call:
#> lmDynamic(data = data, bruteForce = FALSE, formula = y ~ .)
#> 
#> Coefficients:
#>   (Intercept)            x1            x2            x3            x4 
#>  2.275600e-01  1.437598e-10  9.056750e-03 -2.242035e-03 -3.163102e-03 
#>            x5            x6            x7            x8            x9 
#>  7.583379e-03 -4.518065e-04  1.844128e-07  1.955973e-04  2.565518e-03 
#>           x10           x11           x12           x13           x14 
#>  9.390505e-10  4.292303e-03  1.049070e-03 -3.451540e-03  3.436896e-04 
#>           x15           x16           x17           x18           x19 
#>  3.153094e-03 -1.651838e-09  4.594412e-05  4.423451e-08  2.403370e-04 
#>           x20           x21           x22           x23           x24 
#>  4.107177e-05 -8.106787e-06 -4.584408e-04 -5.177786e-08  2.545252e-10 
#>           x25           x26 
#>  1.061938e-10  1.910200e-03

# OK
GreyboxModel(data[1:1000,], 'lmCombine')
#> Call:
#> lmCombine(data = data, bruteForce = FALSE, formula = y ~ .)
#> 
#> Coefficients:
#>   (Intercept)            x2           x11            x6            x8 
#>  0.4141842032  0.0103885068  0.0031874725 -0.0004520585  0.0002475515 
#>            x9            x5           x15           x13           x19 
#>  0.0025322978  0.0074380731  0.0031228157 -0.0033085031  0.0002062875

# error
GreyboxModel(data, 'lmCombine')
#> Error in logLik(ourModel): object 'ourModel' not found

# error
GreyboxModel(data[1:1300,], 'lmCombine')
#> Error in logLik(ourModel): object 'ourModel' not found

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] greybox_0.5.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1         urca_1.3-0         nloptr_1.2.1      
#>  [4] compiler_3.5.1     pillar_1.3.1       plyr_1.8.4        
#>  [7] highr_0.8          xts_0.11-2         tseries_0.10-45   
#> [10] tools_3.5.1        digest_0.6.18      nlme_3.1-137      
#> [13] evaluate_0.13      tibble_2.1.1       gtable_0.3.0      
#> [16] lattice_0.20-38    pkgconfig_2.0.2    rlang_0.3.4       
#> [19] curl_3.3           yaml_2.2.0         parallel_3.5.1    
#> [22] xfun_0.5.2         stringr_1.4.0      dplyr_0.8.0.1     
#> [25] knitr_1.22         hms_0.4.2          uroot_2.0-9       
#> [28] lmtest_0.9-36      grid_3.5.1         nnet_7.3-12       
#> [31] forecast_8.4       tidyselect_0.2.5   glue_1.3.1        
#> [34] R6_2.4.0           lamW_1.3.0         rmarkdown_1.12    
#> [37] readr_1.3.1        TTR_0.23-4         ggplot2_3.1.0     
#> [40] purrr_0.3.2        magrittr_1.5       scales_1.0.0      
#> [43] htmltools_0.3.6    quantmod_0.4-13    assertthat_0.2.1  
#> [46] timeDate_3043.102  colorspace_1.4-1   numDeriv_2016.8-1 
#> [49] fracdiff_1.4-2     quadprog_1.5-5     stringi_1.4.3     
#> [52] RcppParallel_4.4.2 lazyeval_0.2.2     munsell_0.5.0     
#> [55] crayon_1.3.4       zoo_1.8-4

Created on 2019-04-23 by the reprex package (v0.2.1)

config-i1 commented 5 years ago

Thanks for the detailed explanation! This is now fixed in v0.5.041002 on github: 78bfb9e24a987b9841bad37d3edb091fde4ac963

leungi commented 5 years ago

Awesome; thanks for the quick fix! 👍