Closed billray0259 closed 4 years ago
To group the tickers, I first take the absolute value of the correlation matrix. Then I use scipy's kmeans algorithm to group the columns of the absolute values correlation matrix. The column names are tickers, and the groups are groups. The codebook generated by kmeans is not stored anywhere, but it likely carries some useful information about the relationship between the symbols.
Edit: The tickers used in the following results included ETFs. It is possible that these ETFs tracked similar things and allowed the algorithm to achieve higher levels of correlation than if these ETFs were not included.
I looked at the average correlation coefficient between grouped tickers and found that they had an average correlation coefficient equal to 1.7 standard deviations, with some groups as high as 4 standard deviations. When the rows of the coefficient matrix were randomly shuffled, the average correlation coefficient between grouped tickers was only 0.7 standard deviations with a maximum of ~1.5 standard deviations. This grouping process increased the average correlation coefficient by roughly 240%
The code used to test the grouping algorithm:
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
corr_mat_file_name = "data/all/correlation.h5"
groups = "data/all/groups.pkl"
with open(groups, "rb") as groups_file:
groups = pickle.load(groups_file)
corr_mat = pd.read_hdf(corr_mat_file_name)
shuff = corr_mat.sample(frac=1)
shuff.index = shuff.columns
# corr_mat = shuff
std = corr_mat.values.std(ddof=1)
means = []
for group in groups:
group_corr = corr_mat.loc[group, group]
mean_corr = group_corr.mean().mean()
means.append(mean_corr/std)
print(np.mean(means))
plt.hist(means, bins=20)
plt.show()
After testing I changed the get_groupings() method to save the groupings as a dictionary where the keys are floats and the values are the lists of tickers. The actual values of the keys are the average correlation coefficient of that group divided by the standard deviation of all of the correlation coefficients. Now the testing code above needs to be modified to reflect this change.
Using the correlation matrix, find some group(s) of tickers that are strongly related to each other. I have not determined an algorithm to do this yet.