error in results due to negative sampling data leakage

hshahbazi commented 4 years ago

Hi, I've noticed an issue in this work as follows: In module create_batch.py

the negative samples are not allowed to be taken from test data. When valid_triples_dict is defined as: self.valid_triples_dict = {j: i for i, j in enumerate( self.train_triples + self.validation_triples + self.test_triples)}

and later used in:

while (random_entities[current_index], self.batch_indices[last_index + current_index, 1], self.batch_indices[last_index + current_index, 2]) in self.valid_triples_dict.keys():

The model is taught during training that the test data are valid triples. We are not allowed to use test data during training. This will make the performance goes higher and higher with more epochs without any good reasons.

378egg commented 3 years ago

I noticed also, so the paper is an error?

hshahbazi commented 3 years ago

Yes.

378egg commented 3 years ago

Sorry, have you try more epochs? In experiments.

hshahbazi commented 3 years ago

I experimented more epochs with other models such as tucker to see that the learning approach is not right.

wbaz4321 commented 3 years ago

I experimented more epochs with other models such as tucker to see that the learning approach is not right.

sorry, I notice this too and I found another paper 'A Re-evaluation of Knowledge Graph Completion Methods' also mentions this issue. But I suppose the model structure, i.e. the attention mechanism and the encoder-decoder structure is still valid, the problem is in the training and evaluation protocol right? Because I found the idea in this paper interesting and plan to modify and apply it for my work.

hshahbazi commented 3 years ago

Not sure what you mean by valid. Experimentally, the model gives top10=33% for fb15k-237 which is very low for the task. So experimentally, we can not reach to any validity conclusion for any parts.

wbaz4321 commented 3 years ago

Not sure what you mean by valid. Experimentally, the model gives top10=33% for fb15k-237 which is very low for the task. So experimentally, we can not reach to any validity conclusion for any parts.

Many thanks for you reply, could you please tell me how you run the experiment? in this code:

while (random_entities[current_index], self.batch_indices[last_index + current_index, 1], self.batch_indices[last_index + current_index, 2]) in self.valid_triples_dict.keys():

do you change the 'valid_triples_dict' so that it only includes training data?

chauhanjatin10 commented 3 years ago

Hi @wbaz4321 and @hshahbazi , Thanks for showing interest in the work. As you guys have pointed out the issues regarding leakage in the negative sampling, we have been doing experiments to rectify this. After extensive experiments we have empirically concluded that the issue was the use of ConvKB as decoder. When we replace ConvKB with ConvE, the Hits@10 on Freebase reach upto 53% with correct training protocol as proposed in "A Re-evaluation of Knowledge Graph Completion Methods" using the ConvE code provided here - https://github.com/svjan5/kg-reeval . Thus, we have evidence supporting the claims of our encoder model which will be updated soon by my co-author @deepakn97 . Also, as shown in the works - https://arxiv.org/pdf/2009.08694.pdf and https://www.aclweb.org/anthology/2020.acl-main.526.pdf , KBGAT's proposed encoder model indeed works well provided that the decoder is strong enough. I hope this clarifies some of your doubts @hshahbazi and @wbaz4321 . Thanks and regards Jatin

jinjinjin9 commented 3 years ago

So, just to be sure, as we can conclude by now (can we?):

with test leakage the full system provides very good scores
without test leakage the full system provides much worse scores (by removing test triples from validation triples)

I don't see how this should be "rectified" by not using ConvKB as the decoder? Is there a bug in ConvKB then? Can you describe your experiments that "rectify the results" more closely? I mean, the two points above just suggest that the bug was in fact in the training and caused by the data leakage.

I am sorry for the inconvenience but I am working with your code and as of now I am very confused.

deepakn97 commented 3 years ago

As you can see in this paper (https://www.aclweb.org/anthology/2020.acl-main.489.pdf), we are dealing with two different issues here.

Test data leakage
Wrong evaluation strategy for the decoder part.

Above mentioned paper removes the test leakage and uses Random evaluation strategy to get the updated results for our paper. However, they don't experiment with the hyperparameters. After re-evaluating our model and using updated hyperparameters we were able to get ~44 Hits @ 10 using ConvKB (These results are after removing test data leakage and using correct evaluation protocol). ConvKB provides significantly less results using the correct evaluation strategy i.e only 42.1 Hits @ 10 compared to reported 51.7 Hits @ 10.

The paper also mentions that ConvE does not suffer from this problem and the results of ConvE is independent of the evaluation protocol. More specifically, many of the neurons in ConvKB become zeros after applying ReLU activation function as a result of which ConvKB predicts same scores for numerous triples which causes the stark difference in the results using different evaluation protocol. Following this reasoning we decided to experiment with ConvE instead of using ConvKB.

I hope this is helpful to you.

jinjinjin9 commented 3 years ago

Thanks for the quick response!

But I wonder.... if it's now 53 with the random evaluation strategy, after you freshly tuned hyper-parameters... it would still fall behind TuckeR (53.6) and would be same as RotatE (53.0), in contrast to what's reported in the original result table displayed in https://www.aclweb.org/anthology/2020.acl-main.489.pdf.

Can you also report new results on the original task formulation (no randomized evaluation strategy), where the paper reported 62.2 hits?

deepakn97 commented 3 years ago

The reported results (with ConvE as decoder) are with randomized evaluation strategy and they are independent of the evaluation strategy. The task formulation is still the same but the evaluation strategy used was wrong. So we don't think you should compare the results with wrong evaluation strategy. We also understand your concern that in the current scenario this is not a big leap in the results. However, we still feel that the contribution of the paper was the Attention model used to learn graph embeddings. As the research progresses, we can use different decoders in continuation with KBGAT. We will try to provide a way to use different decoders (TuckeR, RotatE etc) with our attention model. As of now, we would appreciate if you could save the embeddings generated from the Attention model and use them to train the Decoder models.

wbaz4321 commented 3 years ago

Thanks for your efforts to adjust the model, I also personally think the encoding process is useful and I am currently trying to modify it for my work. Thanks very much for providing a good reference for me. I have another question about the classes SepcialSpmmFinal() and SpecialSpmmFinalFunctionalFinal() could you please tell me what do these function do? In the paper, once we have obtained the exp values of the absolute attention values, we can sum them up the get the value of e_row_sum so why do we fed the values into the special function? Any suggestions are very appreciated

Also, I found that some one else mentions this issue as well, so if you can help us understand this, we will be very grateful.

deepakn97 / relationPrediction

error in results due to negative sampling data leakage #28