AdrianAntico / AutoQuant

R package for automation of machine learning, forecasting, model evaluation, and model interpretation
GNU Affero General Public License v3.0
236 stars 43 forks source link

Model Fails to Build AutoBanditSarima #63

Closed spsanderson closed 4 years ago

spsanderson commented 4 years ago
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RemixAutoML_0.2.4          feasts_0.1.4               fable_0.2.1               
 [4] fabletools_0.2.0           funModeling_1.9.4          Hmisc_4.4-0               
 [7] Formula_1.2-3              forecast_8.12              caret_6.0-86              
[10] lattice_0.20-41            sweep_0.2.2                tibbletime_0.1.5          
[13] ggridges_0.5.2             anomalize_0.2.1            timetk_2.1.0              
[16] forcats_0.5.0              stringr_1.4.0              dplyr_1.0.0               
[19] purrr_0.3.4                readr_1.3.1                tidyr_1.1.0               
[22] tibble_3.0.1               ggplot2_3.3.2              tidyverse_1.3.0           
[25] gamlss.add_5.1-6           rpart_4.1-15               nnet_7.3-14               
[28] mgcv_1.8-31                gamlss_5.1-6               nlme_3.1-148              
[31] gamlss.dist_5.1-6          gamlss.data_5.1-4          fitdistrplus_1.1-1        
[34] survival_3.1-12            MASS_7.3-51.6              patchwork_1.0.1           
[37] tsibble_0.9.1              tidyquant_1.0.1            quantmod_0.4.17           
[40] TTR_0.23-6                 PerformanceAnalytics_2.0.4 xts_0.12-0                
[43] zoo_1.8-8                  lubridate_1.7.9            pacman_0.5.1              

loaded via a namespace (and not attached):
 [1] readxl_1.3.1         backports_1.1.7      fastmatch_1.1-0      plyr_1.8.6          
 [5] lazyeval_0.2.2       entropy_1.2.1        digest_0.6.25        foreach_1.5.0       
 [9] htmltools_0.5.0      arules_1.6-6         fansi_0.4.1          magrittr_1.5        
[13] checkmate_2.0.0      cluster_2.1.0        doParallel_1.0.15    ROCR_1.0-11         
[17] recipes_0.1.13       modelr_0.1.8         gower_0.2.2          anytime_0.3.7       
[21] tseries_0.10-47      jpeg_0.1-8.1         colorspace_1.4-1     blob_1.2.1          
[25] install.load_1.2.3   rvest_0.3.5          haven_2.3.1          xfun_0.15           
[29] crayon_1.3.4         jsonlite_1.7.0       iterators_1.0.12     glue_1.4.1          
[33] gtable_0.3.0         ipred_0.9-9          distributional_0.1.0 Quandl_2.10.0       
[37] scales_1.1.1         DBI_1.1.0            Rcpp_1.0.4.6         htmlTable_2.0.1     
[41] foreign_0.8-80       stats4_4.0.2         lava_1.6.7           prodlim_2019.11.13  
[45] htmlwidgets_1.5.1    httr_1.4.1           RColorBrewer_1.1-2   acepack_1.4.1       
[49] ellipsis_0.3.1       pkgconfig_2.0.3      farver_2.0.3         dbplyr_1.4.4        
[53] utf8_1.1.4           tidyselect_1.1.0     labeling_0.3         rlang_0.4.6         
[57] reshape2_1.4.4       munsell_0.5.0        cellranger_1.1.0     tools_4.0.2         
[61] cli_2.0.2            generics_0.0.2       moments_0.14         broom_0.5.6         
[65] ModelMetrics_1.2.2.2 knitr_1.29           fs_1.4.2             pander_0.6.3        
[69] packrat_0.5.0        xml2_1.3.2           compiler_4.0.2       rstudioapi_0.11     
[73] curl_4.3             png_0.1-7            reprex_0.3.0         stringi_1.4.6       
[77] Matrix_1.2-18        urca_1.3-0           vctrs_0.3.1          pillar_1.4.4        
[81] lifecycle_0.2.0      lmtest_0.9-37        data.table_1.12.8    R6_2.4.1            
[85] latticeExtra_0.6-29  gridExtra_2.3        codetools_0.2-16     assertthat_0.2.1    
[89] withr_2.2.0          fracdiff_1.5-1       hms_0.5.3            quadprog_1.5-8      
[93] grid_4.0.2           timeDate_3043.102    class_7.3-17         prophet_0.6.1       
[97] pROC_1.16.2          base64enc_0.1-3   

Issued the following commands in order:

install.load::install_load(
  "tidyquant"
  ,"timetk"
  , "tibbletime"
  , "tsibble"
  , "sweep"
  , "anomalize"
  , "caret"
  , "forecast"
  , "funModeling"
  # , "xts"
  # , "fpp"
  , "lubridate"
  , "tidyverse"
  # , "urca"
  # , "prophet"
  , "fable"
  , "feasts"
  , "RemixAutoML"
)

# Data ----
url <- "https://cci30.com/ajax/getIndexHistory.php"
destfile <- "data/cci30_OHLCV.csv"
download.file(url, destfile = destfile)
df <- read.csv("data/cci30_OHLCV.csv")
class(df)

# Get month end of file - last day of previous month
# Format Date ####
df$Date <- ymd(df$Date)
df <- df %>%
 mutate(month_start = floor_date(Date, unit = "month") - period(1, units = "day"))

df_tbl <- as_tsibble(df, index = Date) %>%
  filter(Date <= max(month_start)) %>%
  select(Date, Open, High, Low, Close, Volume)

# Coerce df to tibble ####
df_tbl <- as_tibble(df_tbl)
featurePlot(
  x = df_tbl[,c("Open","High","Low","Volume")]
  , y = df_tbl$Close
  , plot = "pairs"
  , auto.key = list(columns = 4)
  , na.action(na.omit)
)

# Time Parameter ----
time_param <- "weekly"

# Make a log returns of close object
df.ts <- df_tbl %>%
  tq_transmute(
    select = Close
    , periodReturn
    , period = time_param
    , type = "log"
    , col_rename = str_c(str_to_title(time_param),"Log_Returns", sep = "_")
  )

> AutoBanditSarima(data = df.ts, TargetVariableName = "Weekly_Log_Returns", DateColumnName = "Date")
[1] "Model was not able to be built"

Data attached

weekly_log_returns.xlsx

AdrianAntico commented 4 years ago

@spsanderson That's quite the dependency list. You should tell Dancho to set up his functions so that users don't have to load all the extra libraries. I remember a CRAN person yelling at me over something similar. Nonetheless, I couldn't get all your functions to run so I'm kind of at a loss. Can you let me know what the final function is doing - tq_transmute() because that was the only one I couldn't figure out how to run. Here's the code I ended up running until the error:

install.load::install_load(
  "tidyquant"
  ,"timetk"
  , "tibbletime"
  , "tsibble"
  , "sweep"
  , "anomalize"
  , "caret"
  , "forecast"
  , "funModeling"
  # , "xts"
  # , "fpp"
  , "lubridate"
  , "tidyverse"
  # , "urca"
  # , "prophet"
  , "fable"
  , "feasts"
  , "RemixAutoML"
)

# Data ----
url <- "https://cci30.com/ajax/getIndexHistory.php"
destfile <- "data/cci30_OHLCV.csv"
data <- data.table::fread(url)
class(data)

# Get month end of file - last day of previous month
# Format Date ####
# library(magrittr)
# df$Date <- lubridate::ymd(df$Date)
# df <- df %>% dplyr::mutate(month_start = floor_date(Date, unit = "month") - period(1, units = "day"))
data.table::set(data, j = "Date", value = as.Date(data$Date))
data[, month_start := lubridate::floor_date(x = Date, unit = "month")][, month_start := month_start - lubridate::days(1)]

# df_tbl <- tsibble::as_tsibble(df, index = Date) %>%
#   filter(Date <= max(month_start)) %>%
#   select(Date, Open, High, Low, Close, Volume)
data <- data[Date <= max(month_start), .SD, .SDcols = names(data)[1:(ncol(data) - 1L)]]

# Coerce df to tibble ####
# df_tbl <- as_tibble(df_tbl)
# featurePlot(
#   x = df_tbl[,c("Open","High","Low","Volume")]
#   , y = df_tbl$Close
#   , plot = "pairs"
#   , auto.key = list(columns = 4)
#   , na.action(na.omit)
# )

caret::featurePlot(
  x = data[, c(1:4)]
  , y = data$Close
  , plot = "pairs"
  , auto.key = list(columns = 4)
  , na.action(na.omit)
)

# Time Parameter ----
time_param <- "weekly"

# Make a log returns of close object
df.ts <- df_tbl %>%
  tq_transmute(
    select = Close
    , periodReturn
    , period = time_param
    , type = "log"
    , col_rename = str_c(str_to_title(time_param),"Log_Returns", sep = "_")
  )

data <- dplyr::as_tibble(data)

#### I was forced to load these up to attempt to run the below function
library(zoo)
library(xts)
library(quantmod)
library(PerformanceAnalytics)
install.packages("time_param");library(time_param)
##### Warning in install.packages : package ‘time_param’ is not available (for R version 4.0.0)

#### Let me know what this does and maybe I can replicate it real quick
data <- data %>%
  tidyquant::tq_transmute(
    select = Close
    , periodReturn
    , period = time_param
    , type = "log"
    , col_rename = str_c(str_to_title(time_param),"Log_Returns", sep = "_")
  )

# I didn't look into this yet...
RemixAutoML::AutoBanditSarima(data = df.ts, TargetVariableName = "Weekly_Log_Returns", DateColumnName = "Date")
AdrianAntico commented 4 years ago

@spsanderson Have you tried filling out all the function arguments for AutoBanditSarima()? That would be where I would start.

spsanderson commented 4 years ago

time_param is a variable that equals “weekly” thought I posted that in

time_param <- “weekly”

Sent from my iPhone

On Jul 7, 2020, at 2:09 AM, Adrian notifications@github.com wrote:

 @spsanderson Have you tried filling out all the function arguments for AutoBanditSarima()? That would be where I would start.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

spsanderson commented 4 years ago

No I did not fill out all the params for the function, I will try that

AdrianAntico commented 4 years ago

@spsanderson The issue was getting the function ts_transmute() to run. I think it required me to upgrade to a different R version. Nevertheless, I'm not sure what the output should look like after I run that function. Is there a way to upload that data somewhere for me to download?

spsanderson commented 4 years ago

the data I uploaded is the final data. weekly_log_returns.xlsx

when I run that data through AutoBanditSarima() the model fails to build, maybe there are not enough data points, when I run the model without any mutation a model builds. the tq_transmute() is getting the log return of the index aggregated by week.

AdrianAntico commented 4 years ago

@spsanderson I was able to run your data through and I did find a glitch. The best model that was found came directly from a default auto.arima(). That hasn't really happened yet. But, it did call attention to some downstream code that thought fourier terms were being used when they in fact weren't (only happens when the auto.arima produces a winner which is not typically the case). I made a fix to that. I am still getting the message that no suitable model was found so there's some more digging for me to do. I would like to note that financial asset pricing data is probably not the best dataset to test time series models. I would imagine that you'd want to include other variables into the model and when that is the case my guess is that machine learning models will do a better job. Ideally, for measuring the efficacy of time series models, you want several data sets where some have trend while others do not, but both have patterns in the data that make time series models more suitable. Just my two cents.

AdrianAntico commented 4 years ago

@spsanderson FYI - I was able to run your data set without error. Feel free to reinstall and give it another attempt.

spsanderson commented 4 years ago

Glad you were able to find the error. For this data Arima actually isn’t horrible it’s typically not the best performing but it’s not horrible the data is aggregated at the weekly level and it is the log returns of an index so we’re not really looking for the price but more or less will the future log returns be positive or negative and luckily the density of the log returns is fairly normal

Sent from my iPhone

On Jul 8, 2020, at 6:26 PM, Adrian notifications@github.com wrote:

 @spsanderson FYI - I was able to run your data set without error. Feel free to reinstall and give it another attempt.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

AdrianAntico commented 4 years ago

From a financial perspective, time series models are essentially technical analysis versus fundamental analysis. To me, the best way to beat the market would be through insider trading (I know, unless you are in congress, you aren't doing it). But it does lead to interesting questions about the type of information that drives the prices changes. Given that, if I really wanted to build a model to do statistical arbitrage, I would want to incorporate other variables into the model. Nonetheless, I have been using data sets from the ppa package and the ppa2 package to analyze model performance and feature upgrades to existing models. I'm pretty sure you know about the book, "Forecasting: Principles and Practice" https://otexts.com/fpp2/, which Rob J Hyndman wrote and he uses data from those packages to go through the book. I also like that Walmart data set to test out forecasting models that can build forecasts by grouping variables so I can see if there is merit to running a single model versus generating a bunch of models for each grouping level.

In terms of the error, the fix was to include a simple tryCatch around an if statement in one of the sub functions that gets called. Apparently, NextGrid doesn't always exist when it's being referenced.

# Define lambda----
      if(run != 1L) {
        tryCatch({if(NextGrid$BoxCox[1L] == "skip") {
          lambda <- NULL
        } else {
          lambda <- "auto"
        }}, error = function(x) lambda <- NULL)
spsanderson commented 4 years ago

Easy fix and yes I know the book it’s the Bible basically

Sent from my iPhone

On Jul 8, 2020, at 7:01 PM, Adrian notifications@github.com wrote:

 From a financial perspective, time series models are essentially technical analysis versus fundamental analysis. To me, the best way to beat the market would be through insider trading (I know, unless you are in congress, you aren't doing it). But it does lead to interesting questions about the type of information that drives the prices changes. Given that, if I really wanted to build a model to do statistical arbitrage, I would want to incorporate other variables into the model. Nonetheless, I have been using data sets from the ppa package and the ppa2 package to analyze model performance and feature upgrades to existing models. I'm pretty sure you know about the book, "Forecasting: Principles and Practice" https://otexts.com/fpp2/, which Rob J Hyndman wrote and he uses data from those packages to go through the book. I also like that Walmart data set to test out forecasting models that can build forecasts by grouping variables so I can see if there is merit to running a single model versus generating a bunch of models for each grouping level.

In terms of the error, the fix was to include a simple tryCatch around an if statement in one of the sub functions that gets called. Apparently, NextGrid doesn't always exist when it's being referenced.

Define lambda----

  if(run != 1L) {
    tryCatch({if(NextGrid$BoxCox[1L] == "skip") {
      lambda <- NULL
    } else {
      lambda <- "auto"
    }}, error = function(x) lambda <- NULL)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.