Airpino / HistDAWass

An R package for histogram data analysis
5 stars 0 forks source link

Prepare time seires data for analysis, (how to construct TdistributionH and HTS class)? #3

Open MislavSag opened 2 years ago

MislavSag commented 2 years ago

I can't figure out from the the CRAN package docs hoe to prepare data for the (cluster) analysis.

I have time series data with intraday frequency. I would like to to identify 3 clusters.

If I understand it right, I need to construct TdistributionH objects from my time series vector. But I am not sure how to transform my POSIXct object to time stamp and how to add timestamp to distributionH objet.

Here is my sample data:

library(highfrequency)
library(data.table)

# data
DT <- highfrequency::sampleOneMinuteData
DT[, ret := STOCK / shift(STOCK) - 1]
DT <- DT[, .(DT, ret)]  

# without time dimension
x <- data2hist(DT$ret)

# construct TdistributionH-class object from DT ??
# CODE HERE

# than how to construct HTS class ??
# COSE HERE
Airpino commented 2 years ago

Dear Mislav, actually, the implementation and the use of TdistributionH and HTS is ongoing, in the sense that the constructors are very basic. I suppose you need distribution for each day for building the HTS (that is a series of histograms). Anyway, for building a TdistributionH the code is

# construct TdistributionH-class object from DT ??
# CODE HERE
My_new_Tdistr <- new("TdistributionH", 
period=list(start=min(DT$DT),end=max(DT$DT)), #here you fix the starting and ending time point
x=x@x,p=x@p,m=x@m,s=x@s)

Now, I will show you how to construct an HTS for each day

library(highfrequency)
library(data.table)
library(tidyverse)

# data
DT <- highfrequency::sampleOneMinuteData
DT[, ret := STOCK / shift(STOCK) - 1]
DT <- DT[, .(DT, ret)]  
DT<-DT %>%  na.omit() %>% mutate(day=format(DT, format = "%Y-%m-%d")) 
tmp<-DT %>%  group_by(day) %>% group_rows() 

# CREATE AN EMPTY list
list_of_t=list()
for (i in 1:length(tmp)){
  #create a TdistributionH
  tmpx <- data2hist(DT$ret[tmp[[i]]] %>% na.omit())
  mint=min(DT$DT[tmp[[i]]])
  maxt=max(DT$DT[tmp[[i]]])
  My_new_Tdistr <- new("TdistributionH", 
                       tstamp=i, #take care because here only numeric values are admitted
                       period=list(start=mint,
                                          end=maxt),
                       x=tmpx@x,p=tmpx@p,m=tmpx@m,s=tmpx@s)
  list_of_t[[i]]<-My_new_Tdistr

}
new_HTS<-new("HTS", epocs=length(tmp),
             ListOfTimedElements=list_of_t)

plot(new_HTS) #see it 

Anyway, cluster methods work only with MatH instances. It means that, if you need to cluster them via k-means (for example), you have to construct the following code:

# CREATE AN EMPTY HTS
list_of_t=list()
for (i in 1:length(tmp)){
  #create a TdistributionH
  tmpx <- data2hist(DT$ret[tmp[[i]]] %>% na.omit())
  list_of_t[[i]]<-tmpx

}

new_mat<-MatH(x=list_of_t, nrows=length(tmp),
              ncols = 1,
              rownames = unique(DT$day),
              varnames = "returns")

plot(new_mat, type="DENS") # to see the data
res<-WH_kmeans(new_mat,k=3) #to perform k-means
MislavSag commented 2 years ago

@Airpino ,

Thanks a lot for sample codes.

BACKGROUND

You are right, my plan was to upsample intraday data to daily data by constructing histograms.

Second plan is to use daily or hourly data for multiple stocks (say Sp500 stocks) and make histogram as a cross section of returns.

Actually, my first motivation to inspect your pacakge was this new paper: https://arxiv.org/pdf/2110.11848.pdf I was trying to find the package in R/pyhon tht implements somethind similar.

I want to play around with time serie clustering method to see if is it possible to predict market regimes.

CODE

I understand how to construct objects now. What is really the differences between HTS object and MatH? As I understand, the only defference is timestamp in HTS. I will use MatH in the end, since most functions requre this object.

do you have eny recommendation in applying the models from the package on predicting market regimes. Is it in your opinion the reasonable approach?

I will open new issue if I will have additional questions. Thanks.

Airpino commented 2 years ago

Dear Mislav, as I told you, HTS is just a very basic prototype for which few analysis methods are implemented. This is because I have not yet worked on the analysis of HTS. The main difference between MatH and HTS is that MatH can contain several columns (it is a generalization of a classical data table where each cell has a 1d histogram). At the same time, HTS is a list of a single time series of histograms equipped with time stamps. I am not an expert in financial data analysis, but your approach seems reasonable. There are very few methods implemented in the package for HTS, but there is room for extending classical forecasting techniques to histogram time series. If you have any proposal (Autoregressive techniques, for example, can be implemented using the two.component.regression model for histogram data, moving averages can be implemented too,...) you can write me and I try to give you some hints for that.

MislavSag commented 2 years ago

@Airpino ,

I am playing around with the package. I have tried 3 different aproaches for now:

  1. Construct daily histograms with mimnute data.
  2. Construct hour "cross section histograms" from 500 assets.
  3. Construct aprox weekly histograms from hour data with 10 periods between samples (those are h1 and h2 parameters in he aobve paper if I get it right). For now, first approach have best results insample, from visual inspection of market regimes.

In prediction, we are mostly interesed in out of sample predicitons. Is it possible to predict clusters (where k can be 1 and 2) for n period in your packge?

I am mostly interesed in rolling predicions because this is how it is mostly done in real investing.

Should I use rolling window or expanding window of histograms (distributions)?

jchang183 commented 2 years ago

@Airpino , would like to have a following question of this insightful discussion. Also working on the same paper @MislavSag mentioned above, the in-sample result via HistDAWass package looks very convincing. would like to dive deeper to predict out of sample data, wondering if there is a prediction function for WH_kmeans? many thanks,