Test data leakage? - Githubissues

I've had a look at your recent ICLR20 paper; the results for FB15k-237 are outright amazing! I browsed the source code in this repository to better understand what you do. I stumbled across the following lines in dataset.py:

        for fact in query_ls:
            self.test_fact_ls.append((fact.val, fact.pred_name, tuple(fact.const_ls)))
            self.test_fact_dict[fact.pred_name].add((fact.val, tuple(fact.const_ls)))
            add_ht(fact.pred_name, fact.const_ls, self.ht_dict)

Here query_ls contains the test set facts, and add_ht registers the fact.

If I interpret this correctly, the MLN is constructed as follows. It first adds a variable for each fact r(e1,e2) in the training, validation, and test data. Afterwards, for each such fact, additional variables are (conceptually) added by perturbing e1 or e2: i.e., variables for all facts of form r(e1,?) and r(?,e2) are added as well.

Each of the so-obtained variables is marked as observed (if it appear in the training data) or latent (otherwise).

Is this understanding correct?

The reason I am asking is because such an approach seems to leak validation and test data into training. Why? It's true that the truth values of the validation and test data are not used during training. But: the choice of variables in the MLN already tells the MLN that r(e1,?) and r(?,e2) are sensible query, and consequently provides information about e1 and e2. That's fine for the training data facts. For validation and test facts, however, it's problematic.

For example, consider a test set fact married_to(JohnDoe, JaneDoe). The mere existence of the variables married_to(JohnDoe, ?) informs the (tuneable) embedding of JohnDoe: it must be a person. Likewise for married_to(?, JaneDoe). That's the first reason for potential leakage. Another reason is that, without any inference or learning, one may "look" at the set of created variables and reduce the set of potential wifes for JohnDoe to the set of persons that have been seen as wifes in the validation or test data. (All facts from the training data are observed so that the corresponding wifes are ruled out.) If so, this would significantly simplify the task.

I'd appreciate if you clarified whether the above description is accurate and, in particular, where I misunderstood the approach.

expressGNN / ExpressGNN

Test data leakage? #1