Embedding size and effectiveness of training

c0ntradicti0n commented 5 years ago

Description

I checked out this tutorial: https://github.com/Accenture/AmpliGraph/blob/master/docs/tutorials/ClusteringAndClassificationWithEmbeddings.md and changed the embedding_size to different values k=[1,2,50] and also changed the training procedure also to minimal values from 50 to 1.

Actual Behavior

The classifier, that is trained on a clustering of the embeddings, whatever it classifies, achieves never less than F1 score of 0.53 despite these poor training parameters. (only with k=1 and epochs=1 I get down to 0.525). I also saw the shape of the produced embeddings, it's always the double size of the given embedding_size.

Expected Behavior

Why is this doubling of the embedding_size parameter k? And why this classifier is so effective on the poor embeddings?

Steps to Reproduce

import numpy as np
import pandas as pd
import ampligraph

ampligraph.__version__

import requests
url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/football_graph.csv'
open('football_results.csv', 'wb').write(requests.get(url).content)

df = pd.read_csv("football_results.csv").sort_values("date")

print (df.isna().sum())

df = df.dropna()
df["train"] = df.date < "2014-01-01"
print (df.train.value_counts())

# Entities naming
df["match_id"] = df.index.values.astype(str)
df["match_id"] =  "Match" + df.match_id
df["city_id"] = "City" + df.city.str.title().str.replace(" ", "")
df["country_id"] = "Country" + df.country.str.title().str.replace(" ", "")
df["home_team_id"] = "Team" + df.home_team.str.title().str.replace(" ", "")
df["away_team_id"] = "Team" + df.away_team.str.title().str.replace(" ", "")
df["tournament_id"] = "Tournament" + df.tournament.str.title().str.replace(" ", "")
df["neutral"] = df.neutral.astype(str)

triples = []
for _, row in df[df["train"]].iterrows():
    # Home and away information
    home_team = (row["home_team_id"], "isHomeTeamIn", row["match_id"])
    away_team = (row["away_team_id"], "isAwayTeamIn", row["match_id"])

    # Match results
    if row["home_score"] > row["away_score"]:
        score_home = (row["home_team_id"], "winnerOf", row["match_id"])
        score_away = (row["away_team_id"], "loserOf", row["match_id"])
    elif row["home_score"] < row["away_score"]:
        score_away = (row["away_team_id"], "winnerOf", row["match_id"])
        score_home = (row["home_team_id"], "loserOf", row["match_id"])
    else:
        score_home = (row["home_team_id"], "draws", row["match_id"])
        score_away = (row["away_team_id"], "draws", row["match_id"])
    home_score = (row["match_id"], "homeScores", np.clip(int(row["home_score"]), 0, 5))
    away_score = (row["match_id"], "awayScores", np.clip(int(row["away_score"]), 0, 5))

    # Match characteristics
    tournament = (row["match_id"], "inTournament", row["tournament_id"])
    city = (row["match_id"], "inCity", row["city_id"])
    country = (row["match_id"], "inCountry", row["country_id"])
    neutral = (row["match_id"], "isNeutral", row["neutral"])
    year = (row["match_id"], "atYear", row["date"][:4])

    triples.extend((home_team, away_team, score_home, score_away,
                    tournament, city, country, neutral, year, home_score, away_score))

triples_df = pd.DataFrame(triples, columns=["subject", "predicate", "object"])
triples_df[(triples_df.subject=="Match3129") | (triples_df.object=="Match3129")]

from ampligraph.evaluation import train_test_split_no_unseen

X_train, X_valid = train_test_split_no_unseen(np.array(triples), test_size=10000)

print('Train set size: ', X_train.shape)
print('Test set size: ', X_valid.shape)

from ampligraph.latent_features import ComplEx

import os
from ampligraph.utils import save_model, restore_model

import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

ke_model_path = "./football_ke.amplimodel"
if not os.path.isfile(ke_model_path):
    model = ComplEx(batches_count=50,
                    epochs=1,
                    k=1,
                    eta=20,
                    optimizer='adam',
                    optimizer_params={'lr':1e-3},
                    loss='multiclass_nll',
                    regularizer='LP',
                    regularizer_params={'p':3, 'lambda':1e-5},
                    seed=0,
                    verbose=True)

    print ("Training...")
    model.fit(X_train)
    save_model(model, model_name_path=ke_model_path)

    filter_triples = np.concatenate((X_train, X_valid))
else:
    model = restore_model(model_name_path=ke_model_path)

from sklearn.decomposition import PCA
from incf.countryutils import transformations

print("Extracting Embeddings..")

id_to_name_map = {**dict(zip(df.home_team_id, df.home_team)), **dict(zip(df.away_team_id, df.away_team))}

teams = pd.concat((df.home_team_id[df["train"]], df.away_team_id[df["train"]])).unique()
team_embeddings = dict(zip(teams, model.get_embeddings(teams)))

embeddings_2d = PCA(n_components=2).fit_transform(np.array([i for i in team_embeddings.values()]))

print (embeddings_2d)
first_embeddings = list(team_embeddings.values())[0]
print (first_embeddings)
print (first_embeddings.shape)
print (embeddings_2d.shape)
from ampligraph.discovery import find_clusters
from sklearn.cluster import KMeans

print("Clustering..")

clustering_algorithm = KMeans(n_clusters=6, n_init=50, max_iter=500, random_state=0)
clusters = find_clusters(teams, model, clustering_algorithm, mode='entity')

def cn_to_ctn(country):
    try:
        return transformations.cn_to_ctn(id_to_name_map[country])
    except KeyError:
        return "unk"

plot_df = pd.DataFrame({"teams": teams,
                        "embedding1": embeddings_2d[:, 0],
                        "embedding2": embeddings_2d[:, 1],
                        "continent": pd.Series(teams).apply(cn_to_ctn),
                        "cluster": "cluster" + pd.Series(clusters).astype(str)})

top20teams = ["TeamBelgium", "TeamFrance", "TeamBrazil", "TeamEngland", "TeamPortugal", "TeamCroatia", "TeamSpain",
              "TeamUruguay", "TeamSwitzerland", "TeamDenmark", "TeamArgentina", "TeamGermany", "TeamColombia",
              "TeamItaly", "TeamNetherlands", "TeamChile", "TeamSweden", "TeamMexico", "TeamPoland", "TeamIran"]

from sklearn import metrics
metrics.adjusted_rand_score(plot_df.continent, plot_df.cluster)

df["results"] = (df.home_score > df.away_score).astype(int) + \
                (df.home_score == df.away_score).astype(int)*2 + \
                (df.home_score < df.away_score).astype(int)*3 - 1

df.results.value_counts(normalize=True)

def get_features_target(mask):
    def get_embeddings(team):
        return team_embeddings.get(team, np.full(list(team_embeddings.values())[0].shape[0], np.nan))

    X = np.hstack((np.vstack(df[mask].home_team_id.apply(get_embeddings).values),
                   np.vstack(df[mask].away_team_id.apply(get_embeddings).values)))
    y = df[mask].results.values
    return X, y

clf_X_train, y_train = get_features_target((df["train"]))
clf_X_test, y_test = get_features_target((~df["train"]))

clf_X_train.shape, clf_X_test.shape

np.isnan(clf_X_test).sum()/clf_X_test.shape[1]

from xgboost import XGBClassifier

clf_model = XGBClassifier(n_estimators=550, max_depth=5, objective="multi:softmax")

clf_model.fit(clf_X_train, y_train)

print (df[~df["train"]].results.value_counts(normalize=True))

print (metrics.accuracy_score(y_test, clf_model.predict(clf_X_test)))

tabacof commented 5 years ago

Hi @c0ntradicti0n :

Some models such as ComplEx and HolE use complex-valued embeddings, so there is a real part and an imaginary part. Therefore, for these models, the actual number of embeddings will be twice as what is defined by k.
About the performance issue with varying k, I will look into it. Note that the accuracy score is not the same as F1 score, but your point is definitely valid either way.

tabacof commented 5 years ago

Regarding the second point, I have tried ComplEx and TransE with k equals to 100 and 1 (which result in embedding size of 200 and 2 for ComplEx and 100 and 1 for TransE, for the reasons explained in my previous comment). I also applied Logistic Regression, to see the impact of the classifier itself.

First of all, I managed to reproduce your results where having k=1 seems almost as good as k=100. The MRR and clustering however suffered greatly on this setting.

I actually got the best results with TransE with k=100 and Logistic Regression, the accuracy reached 0.559. TransE also had better MRR and adjusted rand score (clustering metric).

I tried applying SHAP on the case of k=1 but the results were not particularly illuminating.

I cannot explain why the accuracy is still strong with such small embedding size. Somehow the embeddings are able to capture some team quality information in this case (though no geographical information, as the clustering becomes poor). I have double checked the baseline and the target and they all seem appropriate.

I will close this issue as there doesn't seem to be any issue to be actually solved in the tutorial notebook.

Accenture / AmpliGraph