Reported performance on TransE differs significantly (correct hyperparameters used)

luffycodes commented 3 years ago

Description

Reported performance on TransE differs with paramters given at https://docs.ampligraph.org/en/latest/experiments.html on FB15K237 dataset

Actual Behavior

`

mr_score(ranks0) 232.22932772286916 mrr_score(ranks0) 0.23103557722066143 hits_at_n_score(ranks0, n=1) 0.10348370682062824 hits_at_n_score(ranks0, n=3) 0.29459829728936293 hits_at_n_score(ranks0, n=10) 0.4654320383599178 `

Expected Behavior

https://docs.ampligraph.org/en/latest/experiments.html - expected results are posted here with the hyper parameters used.

Steps to Reproduce

`import numpy as np

from sklearn.metrics import brier_score_loss, log_loss from ampligraph.datasets import load_fb15k_237 from ampligraph.latent_features.models import TransE from ampligraph.utils import save_model from ampligraph.evaluation import hits_at_n_score, mr_score, evaluate_performance, mrr_score X = load_fb15k_237() model = TransE(batches_count=64, seed=0, epochs=4000, k=400, eta=30, optimizer='adam', optimizer_params={'lr':0.0001}, loss='multiclass_nll', regularizer='LP', regularizer_params={'lambda': 0.0001, 'p': 2}) model.fit(X['train']) save_model(model, model_name_path = 'transe_seed_0.pkl') filter = np.concatenate((X['train'], X['valid'], X['test'])) ranks0 = evaluate_performance(X['test'], model, filter, verbose=False) mr = mr_score(ranks0) mrr = mrr_score(ranks0) hits_1 = hits_at_n_score(ranks0, n=1) hits_3 = hits_at_n_score(ranks0, n=3) hits_10 = hits_at_n_score(ranks0, n=10)`

sumitpai commented 3 years ago

Hi @luffycodes. I think the decrease in performance is because of overfitting. You have not used early stopping in the fit function. The page below describes the experiment performed on freebase and other datasets.

https://docs.ampligraph.org/en/1.3.2/experiments.html

At the bottom of that page (in note 6) you have the details of the early stopping params that we used during our experiments.

Please try again with those params and you will notice a significant increase in the performance.

luffycodes commented 3 years ago

In note, x_filter is set to train + validation + test ? Is it correct to include test in it? Also, can you explain as to how to declare these variables? also, x_valid is set to validation? what does validation variable set to?

luffycodes commented 3 years ago

I think x_filter should not be include test. Please find results below.

Results with test added in x_filter:

mr 212.50229963792935 mrr 0.30669034959185953 hits_1 0.21579900185928172 hits_3 0.3409824836089637 hits_10 0.48769449065466286

Results without test in x_filter:

mr 218.165573930913 mrr 0.27473425369241977 hits_1 0.17978765045503473 hits_3 0.3084695175653195 hits_10 0.46518739602700854

Code to replicate the results:


from sklearn.metrics import brier_score_loss, log_loss
from ampligraph.datasets import load_fb15k_237
from ampligraph.latent_features.models import TransE
from ampligraph.utils import save_model
from ampligraph.evaluation import hits_at_n_score, mr_score, evaluate_performance, mrr_score
X = load_fb15k_237()
model = TransE(batches_count=64, seed=0, epochs=4000, k=400, eta=30, optimizer='adam', optimizer_params={'lr':0.0001}, loss='multiclass_nll', regularizer='LP', regularizer_params={'lambda': 0.0001, 'p': 2})
filter = np.concatenate((X['train'], X['valid'], X['test']))
train = X['train']
validation = X['valid']
test = X['test']
model.fit(X['train'], 
               early_stopping = True,
              early_stopping_params = \
                      {
                          'x_valid': validation[::2],
                          'criteria':'mrr',
                          'burn_in': 0,
                          'check_interval':50,
                          'stop_interval':4,
                          'x_filter': filter,
                      }
              )

save_model(model, model_name_path = 'transe_seed_0.pkl')
ranks0 = evaluate_performance(X['test'], model, filter, verbose=False)
mr = mr_score(ranks0)
mrr = mrr_score(ranks0)
hits_1 = hits_at_n_score(ranks0, n=1)
hits_3 = hits_at_n_score(ranks0, n=3)
hits_10 = hits_at_n_score(ranks0, n=10)```

sumitpai commented 3 years ago

In note, x_filter is set to train + validation + test ? Is it correct to include test in it?

It depends on your use case. If x_test is a set of made up hypothesis - which may or may not be facts, then in that case x_filter shouldn't contain x_test.

But if x_test is made up of known facts, then we must include it in the filter. This is what is commonly done in the KG community, and is the standard evaluation protocol described in Bordes et al.

Also, can you explain as to how to declare these variables? also, x_valid is set to validation? what does validation variable set to?

What you have done above is correct. 'x_valid': X['valid'][::2]

You can also set it to X['valid'], but we didn't see much increase/decrease in performance. Each early stopping test takes a lot of time, so we reduce the validation set size just for speed.

We include X['test'] in filter for the standard datasets as X['test'] triples are known facts.

luffycodes commented 3 years ago

Got it ! Thanks so much for helping me out with such a detailed reply, and thanks a ton for the code !

luffycodes commented 3 years ago

hey, I read the paper suggested. can you please explain why early stopping algorithm should have access to test dataset?

I understand that the final filtered metrics can have access to test dataset (as done in the paper), but they do not mention stopping early by seeing performance using test dataset (one can use valid dataset to get an estimate of filtered metric).

sumitpai commented 3 years ago

We are not evaluating the early stopping performance on the test set. The mrr computed during early stopping is only on the validation set X['valid'] .

We are only using the test set for filtering out the known facts out of the negatives generated for each validation triple. If the test set is a list of known facts, rather than unsure hypothesis, we filter out test set triples while generating negatives both during validation and testing.

Consider this example: <Alec Guinness, acted_in, star_wars> , and several other similar triples (people who acted in star wars) are in train set

<Harrison_Ford, acted_in, star_wars> in the validation set

and assume that the following are in test set:

<Carrie_Fisher, acted_in, star_wars> <Natalie Portman, acted_in, star_wars> <Mark Hamill, acted_in, star_wars> and 97 more such facts in the test set.

In other words, we have 100 facts, about actors who acted in star wars, in our test set.

During early stopping, say we check for subject side corruption only, i.e. for each triple in validation set, we replace subject with ALL the unique entities present in the graph. Then we filter out all the known facts. Finally we score and rank them.

In our above example, if we had not filtered ALL known facts, i.e. if we had not used test set in the filter, then when we generate corruptions for <Harrison_Ford, acted_in, star_wars> it would also have those 100 triples of the test set in the corruptions. Now if our model ranks <Harrison_Ford, acted_in, star_wars> as 101 (say the 100 test set triples are ranked better than the this), would you call it a bad model? To get the true performance of the model on validation set, we must concatenate the test set in the filters.

Just to summarize, we do not perform early stopping on the test set. We just use it to filter out known facts from the corruptions of validation triples in order to get the true performance of the model on the validation set.

Does that answer your question.

I guess @lukostaz can give a clearer explanation for this. (tagging him to this thread)

Accenture / AmpliGraph