How to do negative sampling with type constraints?

asaluja commented 4 years ago

Hi, thanks for putting this library together. I will put a feature request together in a similar format to the dgl repo:

🚀 Feature

Negative sampling with type constraints in dgl.contrib.sampling.EdgeSampler (via dataloader.sampler.TrainDataset).

Motivation

When using EdgeSampler to sample negative edges in knowledge graph link prediction, it would be useful to incorporate domain-specific type constraints. For example, edges (relations) in a KG are often typed (only specific entity types can slot into the head or tail entities), so an EdgeSampler that only samples negative edges by selecting head/tail nodes from a subset of all possible entities would greatly help.

Alternatives

One idea I had was to create different EdgeSampler objects for relations and then batch the graph based on relations. That way when sampling a mini-batch we are guaranteed that all facts in the batch have the same relation type and can apply the same EdgeSampler object to get negative samples. But it seems doing this requires diving into the C++ sampler code.

Another alternative is a two-step sampling procedure in training where I first a) sample positive edges only without replacement and then b) based on the relation types in the positive edges, sample negative edges from the specific EdgeSampler with replacement. This seems to be cleaner but also somewhat inefficient. Are there other disadvantages to this?

Any guidance and tips on how best to implement this would be great. I'd be happy to contribute it back to the repo.

Pitch

Similar functionality to how type constraints work in OpenKE.

classicsong commented 4 years ago

Thank you for using DGL-KE. For negative sampling with type constraints. We can set the seed of the EdgeSampler to only edges within the constrained edge types. Then the sampled positive edges will only contains certain edge types. For the negative sampling side, we (The C++ sampler) just corrupt the positive edges (head/tail pairs) and combine them with negative heads/tails. The edge types are not changed at this point.

Great thanks if you can contribute this feature!

zheng-da commented 4 years ago

Thanks for the feature request. This is definitely something we should support. DGL-KE does joint negative sampling for efficiency. That is, instead of creating negative edges for each positive edge independently, we corrupt the head/tail node of a group of edges altogether and replace them with a new set of nodes randomly sampled from the graph. We need to extend joint negative sampling to the type constraint setting. We need to maintain the head/tail entities for each relation type. Potentially, we need to control the number of relations in a batch to achieve good efficiency.

asaluja commented 4 years ago

@classicsong @zheng-da thanks for the quick response! Yes, I agree that joint negative sampling is more efficient, so ideally doing joint negative sampling with type constraints would be best. There are probably other ways to do it - batching relations together and applying a special sampler for ever relation type (one sampler only per batch) is one way to do it.

I imagine it will take some time for this to be added to the repo - meanwhile on my end, do you think the two-stage procedure suggested above (sampling positive edges first, then based on sampled relation types sample negative edges) is a good way or is there something easier? I spent some time familiarizing myself with your codebase and it seemed this was the easiest way to do it.

Thanks again for the great work.

zheng-da commented 4 years ago

@asaluja I agree that the two-stage procedure will work and it's something I have in mind as well. The main thing we need to take care of is how to combine this with joint negative sampling. We might need to control the number of relations in a batch so that joint negative sampling can be effective. Our experience is that if we reduce the number of relations in a batch, the performance of the trained embeddings drops. I think we need some experiments to balance computation efficiency and training speed. It'll be great if you can contribute this functionality. Please let us know if you have any questions about the current code base.

vardaan123 commented 3 years ago

Hi @zheng-da @asaluja I have the same use-case i.e. to sample negative samples with constraints on type of head/tail entity. As suggested by you, I set the seed edges to be the edges that belong to a particular edge-type/relation. However, the EvalSampler (or dgl.contrib.sampling.EdgeSampler) corrupts the edges by randomly sampling a node for head or tail position from the set of all entities (which includes both heads and tails). I want the head to be amongst all possible heads in the seed edges (and similarly for tail corruption). Any suggestions how this can be achieved? Thanks in advance.

YijianLiu commented 2 years ago

你好@zheng-da @asaluja 我有相同的用例，即对带有头/尾实体类型限制的负样本进行采样。正如您所建议的，我将种子边缘设置为属于特定边缘类型/关系的边缘。但是，EvalSampler（或dgl.contrib.sampling.EdgeSampler）通过从所有实体（包括头和尾）集合中随机采样节点的头或尾位置来破坏边缘。我希望头部是种子边缘中所有可能的头部之一（对于尾部损坏也是如此）。有什么建议可以实现吗？提前致谢。

Hello, I think you have learned the code in detail, so I want to ask you. On the paper, I see when sampling, the pos_g has 1024 edges but the neg_g has also 1024 edges, it corrupts every triplet 1 time, but not k times as mentioned on the paper, is it right?

awslabs / dgl-ke