Specifing false triplets rather than corruption

AlexisPister commented 5 years ago

Background and Context$

Hi, it seems that all the models can generate false triplets by inversing the subjects and objects of existing ones. However, I try to generate embedding from a graph where each of the triplets have a label 'True' or 'False'. So I would like to explicit the false triplets for the training rather than generate new ones. Is it possible in the current version ?

sumitpai commented 5 years ago

Currently it is not supported. But this is a good suggestion and we discussed and thought that it would be a good feature to add immediately.

This is what we propose and plan to implement:

For training: The fit function gets an additional input called corruption_strategy.

def fit(X, ... , corruption_strategy={'type':'default', 'x_neg':None})

type takes the values from the list ['default', 'external', 'mix']

if type == 'default', we use the current strategy present in the library.
if type == 'external', the user has to pass 'x_neg' and the sampling of negatives is done only from x_neg (but there has to be at least eta negatives in x_neg for every triple in the train set - as required by the training protocol i.e. negative with same relation but either head or tail corrupted)
if type == 'mix', then a mix of triples from both x_neg and generate_corruptions_for_fit would be used (As typically a user would not be able to generate all the negatives, but would have a small list of negatives and it would be complimented by randomly generating)

For evaluation: Same as above but we support only 2 types ['external' or 'default'] - we would not have a mix as it wouldn't make sense to randomly generate corruptions and rank.

Each triple in test set would be ranked against the corresponding corruptions present in the 'x_neg' OR it would be ranked against all the corruptions(i.e. current strategy)

What is your suggestion? Would the above plan suffice for this feature?

AlexisPister commented 5 years ago

I don't understand why it is necessary to have at least the same number of negative examples compared to positive examples ? Can't we just have less negative ones ?

As for the evaluation, I don't get the point of giving external false statements as input at all. The embedding model is not modified at this point right ? Is it to classify them as true or false ?

Apart from these thoughts, it seems good to me, I am looking forward to this feature !

sumitpai commented 5 years ago

If I am correct, you have labelled triples (i.e. positive and negative triples) and you would like to train/evaluate this in a binary classification task (i.e. compute precision, recall, F1, accuracy, etc). This is possible, and something that we can add in the library as a new feature.

We can convert the scores returned by model.predict() into probability estimates using a logistic sigmoid. Using the validation set, we could assess the operating range of the classifier by computing the ROC curve. We can select a decision threshold for classifying a triple as a positive or negative and use this threshold during testing.

Regarding the other point, Ampligraph currently follows the negatives generation protocol described in literature (corruptions based in the local closed-world assumption as described in Bordes2013, i.e. triples that are not present are not False but they are just unseen (they may be positive or negative). The same protocol requires generating at least eta corruptions for each positive triple. We then score the positives and corruptions, and try to maximize the scores for positives. Because of such behaviour, we need to have at least eta negatives for each positive. If this does not happen, i.e. there are less than eta "external" negatives for a given triple, we need to integrate them with synthetic corruptions. As an alternative, we could think about a flag to enable/disable such integration. When disabled though, a positive with not enough negatives should be discarded from the train set, so maybe a pre-processing helper function to do so would be required (i.e. you need to make sure each positive in your training set has at least the desired eta negatives).

The above discussion on negatives during training holds if we want to preserve the local closed world assumption on which our training loop relies on. For example:

If you consider the pairwise loss function, that necessarily requires negatives that differ only in the object of the subject. Otherwise the intuition behind it falls apart.

You process triple after triple in the training set, and for each of those triples you use negative(s) that only differ in either the subject or the object. And that is because you want to train a model to distinguish positives from negatives, so you need meaningful negatives at each step. Using negatives from an external list picked at random without any similarity to the currently processed triple would result in a poorly trained model, I believe. And this is why, for each positive, we want to make sure that there are enough external negatives that only differ in the subject or the object (i.e. this means complying to the LCWA, local closed world assumption)

When generating negatives we rely on LCWA exactly because we want meaningful negatives. LCWA says that corrupting a triple "locally" (è.g. only on one side) guarantees to obtain a corruption which has better chances of being a negative. (There is a paper that kind of proves that)

AlexisPister commented 5 years ago

Yes exactly, thank you for the explanations !

Accenture / AmpliGraph

Specifing false triplets rather than corruption #72