Stock index data generally has much noise and is non-stationary, which is a huge challenge for us using ML(Machine Learning) methods to predict the index. However wavelet transformation, an upgraded version of fourier transformation, can serve as a very good filter to decrease the noise in stock index and smooth the data, thus helping us to focus more on the main trend of stock index.
In the figure, H,L,and H’,L’ are the high-pass and low-pass filters for wavelet decomposition and reconstruction respectively. In the decomposition phase, the low-pass filter removes the higher frequency components of the signal and highpass filter picks up the remaining parts. Then, the filtered signals are downsampled by two and the results are called approximation coefficients and detail coefficients. The reconstruction is just a reversed process of the decomposition and for perfect reconstruction filter banks, we have x = x'. A signal can be further decomposed by cascade algorithm as shown in following equation:
After wavelet transformation, there are two types of stock index data, low-frequency and high-frequency. The ARMA-ML model is trying to using ARMA method to predict the high-frequency data,the detail coefficients, since high-frequency is stationary. While ML methods, such as SVR(Support Vector Regression) and GBR(Gradient Boosting Regression),are trying to predict the low-frequency data, the approximation coefficients. Finally, using the predicted data together to reconstruct the stock index. Generally speaking, ARMA-ML model is trying to complete prediction on the timing series perspective.
Finding appropriate values of p and q in the ARMA(p,q) model can be facilitated by plotting the partial autocorrelation functions for an estimate of p, and likewise using the autocorrelation functions for an estimate of q. Further information can be gleaned by considering the same functions for the residuals of a model fitted with an initial selection of p and q. Brockwell & Davis recommend using AICc for finding p and q
Support vector regression (SVR) is a version of SVM for regression. The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
The datas selected are the daily stock index data of 000300.SH representing the large-cap stocks and 000905.SH representing medium-and-small-cap stocks, including,
Open: Open daily price
High: Highest daily price
Low: Lowest daily price
Close: Close daily price
Volume: Trading volume
AMT: Trading amount
Time range: 2010-01-01 to 2018-03-30
Use the former 4 days' close price to predict the next day's close price. Using 150-day rolling windown to make prediction. Finally, try to make a prediction of 30-day close price.
Use common regression matrices(explained_variance, mean_absolute_error, mean_squared_error, r2_score)to evaluate the results.
Model | ev | mae | mse | r2 |
---|---|---|---|---|
GBR_Model | 0.084507 | 30.393337 | 1426.833774 | 0.046767 |
SVR_Model | -0.246318 | 51.662584 | 4424.650770 | -0.658574 |
GBR_SVR_Model | -0.272351 | 31.929540 | 1403.899401 | -0.441158 |
Stock index, as time series, inspires a lot of research to implement the forecast both in academic area and financial departments. Generally speaking, the main methods used to do prediction are time-series analysis and machine learning models. Some of the research reports and papers have presented good ideas to predict stock index by means of combined_models, such as TS & ML models. Some even use some data processing methods like Wavelet Transformation to make the data properties more suitable to different predictin models. All the reference papers and research reports have been uploaded in the reference folder.
Unfortunately, it seems that none of the model has good prediction power, because the ev and r2 are so small and even negative, which indicate that stock prices cannot be predicted exactly! However, the "noisy" data processing methods and time-series analysis model as well as nonlinear machine learning regression model can serve as some useful tools to do further research in other fields.