ShichenXie / scorecard

Scorecard Development in R, 评分卡
http://shichen.name/scorecard
Other
160 stars 63 forks source link

weird results of the `woebin_ply()` function #23

Closed Leo-Lee15 closed 5 years ago

Leo-Lee15 commented 5 years ago

Hi, I've recently encountered a serious problem using the woebin_ply() function. Here is my code,

model_woe_set <- woebin_ply(select(mod_data, -user, -creation_date), bins =model_woe, print_step = 1)

The output in the Rstudio console is

[INFO] Woe transformating on 88120 rows and 904 columns in 00:05:08

However, when I inspect the data.frame model_woe_set, I get the following results,

model_woe_set %>% dim()
[1] 88120 89023

And furthur, the column names in the model_woe_set data.frame become the following,

[1] "mon_woe"                          "age_woe"                                        
......
[961] "V962"                                                   "V963"                                                   "V964"                                                  
 [964] "V965"                                                   "V966"                                                   "V967"                                                  
 [967] "V968"                                                   "V969"                                                   "V970"                                                  
 [970] "V971"                                                   "V972"                                                   "V973"                                                  
 [973] "V974"                                                   "V975"                                                   "V976"                                                  
 [976] "V977"                                                   "V978"                                                   "V979"                                                  
 [979] "V980"                                                   "V981"                                                   "V982"                                                  
 [982] "V983"                                                   "V984"                                                   "V985"                                                  
 [985] "V986"                                                   "V987"                                                   "V988"                                                  
 [988] "V989"                                                   "V990"                                                   "V991"                                                  
 [991] "V992"                                                   "V993"                                                   "V994"                                                  
 [994] "V995"                                                   "V996"                                                   "V997"                                                  
 [997] "V998"                                                   "V999"                                                   "V1000"                                                 
[1000] "V1001"                                                 
 [ reached getOption("max.print") -- omitted 88023 entries ]

And materialize the model_woe_set would lead to a crash of Rstudio, which I think is the memory is not enough.

In all, this problem is very weird. Sorry I cannot provide a minimal reproducible example since the data cannot be shared.

My session info,

sessioninfo::session_info()
- Session info --------------------------------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                                              
 version  R version 3.5.3 (2019-03-11)                       
 os       Windows 7 x64 SP 1                                 
 system   x86_64, mingw32                                    
 ui       RStudio                                            
 language (EN)                                               
 collate  Chinese (Simplified)_People's Republic of China.936
 ctype    Chinese (Simplified)_People's Republic of China.936
 tz       Asia/Taipei                                        
 date     2019-03-31                                         

- Packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 package       * version    date       lib source                               
 assertthat      0.2.1      2019-03-21 [1] CRAN (R 3.5.3)                       
 backports       1.1.3      2018-12-14 [1] CRAN (R 3.5.1)                       
 bit             1.1-14     2018-05-29 [1] CRAN (R 3.5.0)                       
 bit64           0.9-7      2017-05-08 [1] CRAN (R 3.5.0)                       
 blob            1.1.1      2018-03-25 [1] CRAN (R 3.5.1)                       
 broom           0.5.1      2018-12-05 [1] CRAN (R 3.5.1)                       
 cellranger      1.1.0      2016-07-27 [1] CRAN (R 3.5.1)                       
 cli             1.1.0      2019-03-19 [1] CRAN (R 3.5.3)                       
 clipr         * 0.5.0      2019-01-11 [1] CRAN (R 3.5.2)                       
 codetools       0.2-16     2018-12-24 [1] CRAN (R 3.5.3)                       
 colorspace      1.4-1      2019-03-18 [1] CRAN (R 3.5.2)                       
 crayon          1.3.4      2017-09-16 [1] CRAN (R 3.5.1)                       
 data.table      1.12.0     2019-01-13 [1] CRAN (R 3.5.3)                       
 DBI             1.0.0      2018-05-02 [1] CRAN (R 3.5.1)                       
 digest          0.6.18     2018-10-10 [1] CRAN (R 3.5.1)                       
 doParallel      1.0.14     2018-09-24 [1] CRAN (R 3.5.1)                       
 dplyr         * 0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                       
 DT              0.5        2018-11-05 [1] CRAN (R 3.5.1)                       
 forcats       * 0.4.0      2019-02-17 [1] CRAN (R 3.5.2)                       
 foreach         1.4.4      2017-12-12 [1] CRAN (R 3.5.1)                       
 furrr           0.1.0      2018-05-16 [1] CRAN (R 3.5.1)                       
 future          1.12.0     2019-03-08 [1] CRAN (R 3.5.3)                       
 generics        0.0.2      2018-11-29 [1] CRAN (R 3.5.1)                       
 ggplot2       * 3.1.0      2018-10-25 [1] CRAN (R 3.5.1)                       
 globals         0.12.4     2018-10-11 [1] CRAN (R 3.5.1)                       
 glue            1.3.1      2019-03-12 [1] CRAN (R 3.5.3)                       
 gridExtra       2.3        2017-09-09 [1] CRAN (R 3.5.1)                       
 gtable          0.3.0      2019-03-25 [1] CRAN (R 3.5.3)                       
 haven           2.1.0      2019-02-19 [1] CRAN (R 3.5.2)                       
 hms             0.4.2.9001 2018-09-04 [1] Github (tidyverse/hms@979286f)       
 htmltools       0.3.6.9003 2018-12-11 [1] Github (rstudio/htmltools@99a78d0)   
 htmlwidgets     1.3        2018-09-30 [1] CRAN (R 3.5.1)                       
 httr            1.4.0      2018-12-11 [1] CRAN (R 3.5.1)                       
 iterators       1.0.10     2018-07-13 [1] CRAN (R 3.5.1)                       
 janitor       * 1.1.1      2018-07-31 [1] CRAN (R 3.5.1)                       
 jsonlite        1.6        2018-12-07 [1] CRAN (R 3.5.1)                       
 lattice         0.20-38    2018-11-04 [1] CRAN (R 3.5.3)                       
 lazyeval        0.2.2      2019-03-15 [1] CRAN (R 3.5.3)                       
 listenv         0.7.0      2018-01-21 [1] CRAN (R 3.5.1)                       
 lubridate       1.7.4      2018-04-11 [1] CRAN (R 3.5.1)                       
 magrittr        1.5        2014-11-22 [1] CRAN (R 3.5.1)                       
 modelr          0.1.4      2019-02-18 [1] CRAN (R 3.5.2)                       
 munsell         0.5.0      2018-06-12 [1] CRAN (R 3.5.1)                       
 nlme            3.1-137    2018-04-07 [1] CRAN (R 3.5.3)                       
 odbc          * 1.1.6      2018-06-09 [1] CRAN (R 3.5.1)                       
 openxlsx        4.1.0      2018-05-26 [1] CRAN (R 3.5.1)                       
 patchwork     * 0.0.1      2018-09-04 [1] Github (thomasp85/patchwork@7fb35b1) 
 pillar          1.3.1      2018-12-15 [1] CRAN (R 3.5.1)                       
 pkgconfig       2.0.2      2018-08-16 [1] CRAN (R 3.5.1)                       
 plyr            1.8.4      2016-06-08 [1] CRAN (R 3.5.1)                       
 ppdai         * 0.1.2      2018-11-11 [1] local                                
 ppdai.extra   * 0.2.3.9999 2019-03-13 [1] local                                
 purrr         * 0.3.2      2019-03-15 [1] CRAN (R 3.5.3)                       
 qs              0.14.1     2019-03-02 [1] CRAN (R 3.5.3)                       
 R6              2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                       
 RApiSerialize   0.1.0      2014-04-19 [1] CRAN (R 3.5.2)                       
 Rcpp            1.0.1      2019-03-17 [1] CRAN (R 3.5.3)                       
 readr         * 1.3.1      2018-12-21 [1] CRAN (R 3.5.1)                       
 readxl          1.3.1      2019-03-13 [1] CRAN (R 3.5.3)                       
 rlang           0.3.3      2019-03-29 [1] CRAN (R 3.5.3)                       
 rstudioapi      0.10       2019-03-19 [1] CRAN (R 3.5.3)                       
 rvest           0.3.2      2016-06-17 [1] CRAN (R 3.5.1)                       
 scales          1.0.0      2018-08-09 [1] CRAN (R 3.5.1)                       
 scorecard     * 0.2.4      2019-03-29 [1] Github (ShichenXie/scorecard@5b45fb8)
 sessioninfo     1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                       
 stringi         1.4.3      2019-03-12 [1] CRAN (R 3.5.3)                       
 stringr       * 1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                       
 tibble        * 2.1.1      2019-03-16 [1] CRAN (R 3.5.3)                       
 tidyr         * 0.8.3      2019-03-01 [1] CRAN (R 3.5.2)                       
 tidyselect      0.2.5      2018-10-11 [1] CRAN (R 3.5.1)                       
 tidyverse     * 1.2.1      2017-11-14 [1] CRAN (R 3.5.3)                       
 withr           2.1.2      2018-03-15 [1] CRAN (R 3.5.1)                       
 writexl         1.1        2018-12-02 [1] CRAN (R 3.5.1)                       
 xml2            1.2.0      2018-01-24 [1] CRAN (R 3.5.1)                       
 yaml            2.2.0      2018-07-25 [1] CRAN (R 3.5.1)                       
 zip             2.0.1      2019-03-11 [1] CRAN (R 3.5.3)                       

[1] C:/Program Files/R/R-3.5.3/library

Hope a quick fix. Thanks!

ShichenXie commented 5 years ago

没法重现的话,我也没办法解决啊

Leo-Lee15 commented 5 years ago

我今天发现了是一个变量的问题,分箱映射的时候删掉整个变量就OK了,我觉得可能是data.table包的问题,因为单独对这个变量做分箱映射会直接报错。还是谢谢您!

ShichenXie commented 5 years ago

你可以把这列x+y+bins发我看看

Leo-Lee15 commented 5 years ago

这几天上班事情太多了,正打算这周末发给您看看😂😂