Asking questions about the fairness of data augmentation

dbbice commented 7 months ago

Thank you for your work. I read your work carefully, analyzed the bias of the KGC task from a new perspective, and used KG-Mixup for data augmentation.

I have a question. In the mixing criteria, we select triples that share the tail entity to be predicted as candidate triples for mixing. This means that we know in advance what the tail entity is. Will this cause data leakage or be unfair?

Looking forward to your reply.

HarryShomer commented 7 months ago

Hi,

Because the augmentation is only done on the training data there is no leakage. We already know all the positive training samples from beforehand. Also, since we are mixing the tail entity with another entity, the result can be very different from what the model has seen before.

Furthermore, it may be helpful to image an extreme scenario where the random value $\lambda$ is always equal to 1. That means the mixed entity is just equal to the original entity, resulting in the original sample. Our model then simply degenerates to basic over-sampling of samples with a low tail-relation degree. In such a situation there is obviously no leakage as we're just training on some samples more.

Regards, Harry

dbbice commented 7 months ago

Thank you for your reply. I have understood the content of the paper.

HarryShomer / KG-Mixup

Asking questions about the fairness of data augmentation #2