kennis222 commented 2 years ago

Hello contributors,

When I used the time series model such as var model or sarimax model in multivariate time series , after fitting the model and printing the model.summary(), the information seemed that model automatically did the feature selection. However, when I read the source codes such as build_var, and init , I could not figure out what kinds of method that the package uses to select features. For example, there are 6 time series as inputs, but the model summary only displayed the information between one input and the dependent variable, in other words, the output.

Does the package implements any feature selection function when using time series model such as var or sarimax? If the answer is "Yes", where are the detailed parts? Thank you.

AutoViML commented 2 years ago

Hi @kennis222 👍 Could you please illustrate by means of a couple of screenshots as to what you are inputting and what you are getting back? That would be really helpful to understand what you mean here. AutoViML

kennis222 commented 2 years ago

For example, I used a dataset from AirQualityUCI, and I set target variable: CO(GT). After fitting the model, and I printed the model summary as you can see below the screenshots. The dependent variables only displayed the 'CO(GT)' and 'NO2(GT)'. It seemed that model automatically did the feature selection. I was not sure. I read the source codes such as build_var, and init , but I could not figure out what kinds of method that the package uses to select features.
Screen Shot 2022-01-14 at 3 57 29 PM Screen Shot 2022-01-14 at 3 58 14 PM

AutoViML commented 2 years ago

Hi @kennis222 👍 You are correct - the VAR model does feature selection automatically - it is not something that is encoded. It is part of the VAR modeling process. I hope that this screenshot clarifies how it works.

You can see that it selects the best variable automatically here

I hope this answers your question. If so, please close the issue. Thanks AutoViML

kennis222 commented 2 years ago

Hi contributors, I have checked to use the statsmodel_varmax model directly, which is based on the source codes. However, the results displayed are not the same as the results from package.
Screen Shot 2022-01-17 at 1 53 34 PM

Let me provide the codes I used for this example. import statsmodels.api as sm from auto_ts import auto_timeseries import pandas as pd import numpy as np

data = pd.read_excel("AirQualityUCI.xlsx") data.drop(['Unnamed: 15','Unnamed: 16'],axis=1,inplace=True) df = data.groupby(['Date']).mean().reset_index() df['Date'] = df['Date'].astype('str') length = int(len(df)*0.9) train_data = df[:length] test_data = df[length:] print(train_data.shape) print(test_data.shape)

Auto_TS

ts_column = 'Date' target ='CO(GT)' model = auto_timeseries(score_type='rmse',model_type=['VAR'],verbose=2) model.fit( traindata=train_data, ts_column=ts_column, target=target, cv=3,sep = ',') var_model = model.get_best_model() var_model.summary()

varmax

train_data_test = train_data.copy() train_data_test.index= train_data_test['Date'] train_data_test.drop(['Date'],axis=1,inplace=True) endog = train_data_test.loc[train_data_test.index, list(train_data_test.columns)]. #select all variables as endog mod = sm.tsa.VARMAX(endog=endog,order=(1,1)) res = mod.fit(maxiter=1000, disp=False) print(res.summary())

P.S. When you are developing the package, do you process something to control to figure out the optimal parameters or features?

kennis222 commented 2 years ago

I think I have figured out the issue, no offense and I am just curious whether it is possible to be improved. In build_var.py, under the find_best_parameters function ''' for d_val in range(1, dmax):

Takes the target column and one other endogenous column at a time

        # and makes a prediction based on that. Then selects the best
        # exogenous column at the end.

''' "Takes the target column and one other endogenous column at a time" is the reason why the best variable selected for VAR: AH. However, the best VAR model may be included more than one other endogenous variable.

AutoViML commented 2 years ago

hi @kennis222 👍 You are correct that we are choosing maximum one variable. The reason is that VARMAX is very slow even for small datasets. If we were to try every possible combination of variables, we could be running for a very long time even for tiny datasets. Hence a choice was made it to limit it to one. If you have a better way or suggestions to make it faster and better, let us know. AutoViML

kennis222 commented 2 years ago

Thank you.

AutoViML / Auto_TS

feature selection in time series model #71

Auto_TS

varmax

Takes the target column and one other endogenous column at a time