Closed dchang56 closed 4 years ago
The negative edges is created by corrupting the head or tail vertices of the positive edge. So we are actually sampling negative vertices.
For genChunkedNegEdgeSubgraph, there is a nge_mode indicating whether we are corrupting head vertices or tail vertices.
I understand that for the link prediction evaluation task we sample all possible nodes (global_neg_vids.size()) for each positive triple, but I'm not sure if that explains why in this case num_chunks*neg_sample_size must equal global_neg_vids.size(). What about cases like triple classification where you only want to sample 1 negative edge for each positive triple? That's what I'm trying to do. I think I circumvented the problem by using "head" or "tail" instead of "chunk-head" or "chunk-tail" as the mode, thought I'm not exactly sure why.
Am I making any sense, or is there something I'm missing?
This sampler is designed for dgl-ke, a specific knowledge graph embedding toolkit, to sample negative edges for positive edges. I don;t think this can work with triple classification.
In line 1315 of sampler.cc, there's a check as follows:
CHECK_EQ(num_chunks * neg_sample_size, global_neg_vids.size());
I'm wondering why you'd want num_chunksneg_sample_size == global_neg_vids.size(), and not num_chunksneg_sample_size == num_neg_edges
If the neg_sample_size and num_chunks determine the total number of negative edges you want to sample, then shouldn't it be the latter?
I'd appreciate any insight on this.
What's weirder is that I used the sampler to do 1:1 sampling (neg_sample_size=1, chunk_size=1) on the WN18 validation set to do triple classification, and it worked. However, when I tried doing the same thing for my custom dataset, it failed the CHECK_EQ line I mentioned above.
The size of valid set for WN18 is 5000, and given neg_sample_size=1 and chunk_size=1, it should produce a pos_g of size 5000 and neg_g of size 5000. It should also have failed the CHECK_EQ line because the num_chunks would be num_pos_edges/chunk_size=5000, and 5000*1 != global_neg_vids.size() which is 40943, but it does not fail and works just fine.
In contrast, for my custom dataset, the valid set size is 293713, neg_sample_size=1, and chunk_size=1. So num_chunks is 293713, and 293713*1 != 97238 (the number of nodes), so it fails.
The exact error message is: Check failed: num_chunks * neg_sample_size == global_neg_vids.size() (293713 vs. 97238)
Is this a bug? I'd appreciate any insights!