Accenture / AmpliGraph

Python library for Representation Learning on Knowledge Graphs https://docs.ampligraph.org
Apache License 2.0
2.16k stars 250 forks source link

Reopening issue 222 #223

Open luffycodes opened 3 years ago

luffycodes commented 3 years ago

hey, I read the paper suggested. can you please explain why early stopping algorithm should have access to test dataset?

I understand that the final filtered metrics can have access to test dataset (as done in the paper, to report the actual performance), but the paper does not mention stopping early by seeing performance using test dataset (one can use valid dataset to get an estimate of filtered metrics).

Citing the paper by Bordes verbatim: "selected the best model by early stopping using the mean rank on the validation sets (with a total of at most 1,000 epochs over the training data)"

Originally posted by @luffycodes in https://github.com/Accenture/AmpliGraph/issues/222#issuecomment-771786089

sumitpai commented 3 years ago

We are not evaluating the early stopping performance on the test set. The mrr computed during early stopping is only on the validation set X['valid'] .

We are only using the test set for filtering out the known facts out of the negatives generated for each validation triple. If the test set is a list of known facts, rather than unsure hypothesis, we filter out test set triples while generating negatives both during validation and testing.

Consider this example: <Alec Guinness, acted_in, star_wars> , and several other similar triples (people who acted in star wars) are in train set

<Harrison_Ford, acted_in, star_wars> in the validation set

and assume that the following are in test set:

<Carrie_Fisher, acted_in, star_wars> <Natalie Portman, acted_in, star_wars> <Mark Hamill, acted_in, star_wars> and 97 more such facts in the test set.

In other words, we have 100 facts, about actors who acted in star wars, in our test set.

During early stopping, say we check for subject side corruption only, i.e. for each triple in validation set, we replace subject with ALL the unique entities present in the graph. Then we filter out all the known facts. Finally we score and rank them.

In our above example, if we had not filtered ALL known facts, i.e. if we had not used test set in the filter, then when we generate corruptions for <Harrison_Ford, acted_in, star_wars> it would also have those 100 triples of the test set in the corruptions. Now if our model ranks <Harrison_Ford, acted_in, star_wars> as 101 (say the 100 test set triples are ranked better than the this), would you call it a bad model? To get the true performance of the model on validation set, we must concatenate the test set in the filters.

Just to summarize, we do not perform early stopping on the test set. We just use it to filter out known facts from the corruptions of validation triples in order to get the true performance of the model on the validation set.

Does that answer your question?

I guess @lukostaz can give a clearer explanation for this. (tagging him to this thread)