Group tickers into correlated (or inversely correlated) groups

billray0259 commented 4 years ago

Using the correlation matrix, find some group(s) of tickers that are strongly related to each other. I have not determined an algorithm to do this yet.

billray0259 commented 4 years ago

To group the tickers, I first take the absolute value of the correlation matrix. Then I use scipy's kmeans algorithm to group the columns of the absolute values correlation matrix. The column names are tickers, and the groups are groups. The codebook generated by kmeans is not stored anywhere, but it likely carries some useful information about the relationship between the symbols.

Edit: The tickers used in the following results included ETFs. It is possible that these ETFs tracked similar things and allowed the algorithm to achieve higher levels of correlation than if these ETFs were not included.

I looked at the average correlation coefficient between grouped tickers and found that they had an average correlation coefficient equal to 1.7 standard deviations, with some groups as high as 4 standard deviations. When the rows of the coefficient matrix were randomly shuffled, the average correlation coefficient between grouped tickers was only 0.7 standard deviations with a maximum of ~1.5 standard deviations. This grouping process increased the average correlation coefficient by roughly 240%

The code used to test the grouping algorithm:

import pickle
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

corr_mat_file_name = "data/all/correlation.h5"
groups = "data/all/groups.pkl"

with open(groups, "rb") as groups_file:
    groups = pickle.load(groups_file)

corr_mat = pd.read_hdf(corr_mat_file_name)

shuff = corr_mat.sample(frac=1)
shuff.index = shuff.columns

# corr_mat = shuff

std = corr_mat.values.std(ddof=1)

means = []

for group in groups:
    group_corr = corr_mat.loc[group, group]
    mean_corr = group_corr.mean().mean()
    means.append(mean_corr/std)

print(np.mean(means))

plt.hist(means, bins=20)
plt.show()

billray0259 commented 4 years ago

After testing I changed the get_groupings() method to save the groupings as a dictionary where the keys are floats and the values are the lists of tickers. The actual values of the keys are the average correlation coefficient of that group divided by the standard deviation of all of the correlation coefficients. Now the testing code above needs to be modified to reflect this change.

billray0259 / stockbot_2020

Group tickers into correlated (or inversely correlated) groups #5