Closed luffycodes closed 3 years ago
Hi @luffycodes. I think the decrease in performance is because of overfitting. You have not used early stopping in the fit function. The page below describes the experiment performed on freebase and other datasets.
https://docs.ampligraph.org/en/1.3.2/experiments.html
At the bottom of that page (in note 6) you have the details of the early stopping params that we used during our experiments.
Please try again with those params and you will notice a significant increase in the performance.
In note, x_filter is set to train + validation + test ? Is it correct to include test in it? Also, can you explain as to how to declare these variables? also, x_valid is set to validation? what does validation variable set to?
I think x_filter should not be include test. Please find results below.
Results with test added in x_filter:
mr 212.50229963792935 mrr 0.30669034959185953 hits_1 0.21579900185928172 hits_3 0.3409824836089637 hits_10 0.48769449065466286
Results without test in x_filter:
mr 218.165573930913 mrr 0.27473425369241977 hits_1 0.17978765045503473 hits_3 0.3084695175653195 hits_10 0.46518739602700854
Code to replicate the results:
from sklearn.metrics import brier_score_loss, log_loss
from ampligraph.datasets import load_fb15k_237
from ampligraph.latent_features.models import TransE
from ampligraph.utils import save_model
from ampligraph.evaluation import hits_at_n_score, mr_score, evaluate_performance, mrr_score
X = load_fb15k_237()
model = TransE(batches_count=64, seed=0, epochs=4000, k=400, eta=30, optimizer='adam', optimizer_params={'lr':0.0001}, loss='multiclass_nll', regularizer='LP', regularizer_params={'lambda': 0.0001, 'p': 2})
filter = np.concatenate((X['train'], X['valid'], X['test']))
train = X['train']
validation = X['valid']
test = X['test']
model.fit(X['train'],
early_stopping = True,
early_stopping_params = \
{
'x_valid': validation[::2],
'criteria':'mrr',
'burn_in': 0,
'check_interval':50,
'stop_interval':4,
'x_filter': filter,
}
)
save_model(model, model_name_path = 'transe_seed_0.pkl')
ranks0 = evaluate_performance(X['test'], model, filter, verbose=False)
mr = mr_score(ranks0)
mrr = mrr_score(ranks0)
hits_1 = hits_at_n_score(ranks0, n=1)
hits_3 = hits_at_n_score(ranks0, n=3)
hits_10 = hits_at_n_score(ranks0, n=10)```
In note, x_filter is set to train + validation + test ? Is it correct to include test in it?
It depends on your use case. If x_test is a set of made up hypothesis - which may or may not be facts, then in that case x_filter shouldn't contain x_test.
But if x_test is made up of known facts, then we must include it in the filter. This is what is commonly done in the KG community, and is the standard evaluation protocol described in Bordes et al.
Also, can you explain as to how to declare these variables? also, x_valid is set to validation? what does validation variable set to?
What you have done above is correct. 'x_valid': X['valid'][::2]
You can also set it to X['valid'], but we didn't see much increase/decrease in performance. Each early stopping test takes a lot of time, so we reduce the validation set size just for speed.
We include X['test'] in filter for the standard datasets as X['test'] triples are known facts.
Got it ! Thanks so much for helping me out with such a detailed reply, and thanks a ton for the code !
hey, I read the paper suggested. can you please explain why early stopping algorithm should have access to test dataset?
I understand that the final filtered metrics can have access to test dataset (as done in the paper), but they do not mention stopping early by seeing performance using test dataset (one can use valid dataset to get an estimate of filtered metric).
We are not evaluating the early stopping performance on the test set. The mrr computed during early stopping is only on the validation set X['valid']
.
We are only using the test set for filtering out the known facts out of the negatives generated for each validation triple. If the test set is a list of known facts, rather than unsure hypothesis, we filter out test set triples while generating negatives both during validation and testing.
Consider this example: <Alec Guinness, acted_in, star_wars> , and several other similar triples (people who acted in star wars) are in train set
<Harrison_Ford, acted_in, star_wars> in the validation set
and assume that the following are in test set:
<Carrie_Fisher, acted_in, star_wars> <Natalie Portman, acted_in, star_wars> <Mark Hamill, acted_in, star_wars> and 97 more such facts in the test set.
In other words, we have 100 facts, about actors who acted in star wars, in our test set.
During early stopping, say we check for subject side corruption only, i.e. for each triple in validation set, we replace subject with ALL the unique entities present in the graph. Then we filter out all the known facts. Finally we score and rank them.
In our above example, if we had not filtered ALL known facts, i.e. if we had not used test set in the filter, then when we generate corruptions for <Harrison_Ford, acted_in, star_wars> it would also have those 100 triples of the test set in the corruptions. Now if our model ranks <Harrison_Ford, acted_in, star_wars> as 101 (say the 100 test set triples are ranked better than the this), would you call it a bad model? To get the true performance of the model on validation set, we must concatenate the test set in the filters.
Just to summarize, we do not perform early stopping on the test set. We just use it to filter out known facts from the corruptions of validation triples in order to get the true performance of the model on the validation set.
Does that answer your question.
I guess @lukostaz can give a clearer explanation for this. (tagging him to this thread)
Description
Reported performance on TransE differs with paramters given at https://docs.ampligraph.org/en/latest/experiments.html on FB15K237 dataset
Actual Behavior
`
Expected Behavior
https://docs.ampligraph.org/en/latest/experiments.html - expected results are posted here with the hyper parameters used.
Steps to Reproduce
`import numpy as np
from sklearn.metrics import brier_score_loss, log_loss from ampligraph.datasets import load_fb15k_237 from ampligraph.latent_features.models import TransE from ampligraph.utils import save_model from ampligraph.evaluation import hits_at_n_score, mr_score, evaluate_performance, mrr_score X = load_fb15k_237() model = TransE(batches_count=64, seed=0, epochs=4000, k=400, eta=30, optimizer='adam', optimizer_params={'lr':0.0001}, loss='multiclass_nll', regularizer='LP', regularizer_params={'lambda': 0.0001, 'p': 2}) model.fit(X['train']) save_model(model, model_name_path = 'transe_seed_0.pkl') filter = np.concatenate((X['train'], X['valid'], X['test'])) ranks0 = evaluate_performance(X['test'], model, filter, verbose=False) mr = mr_score(ranks0) mrr = mrr_score(ranks0) hits_1 = hits_at_n_score(ranks0, n=1) hits_3 = hits_at_n_score(ranks0, n=3) hits_10 = hits_at_n_score(ranks0, n=10)`