dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.51k stars 3.02k forks source link

Confusing line in src/graph/sampler.cc involving CHECK_EQ for num_chunks, neg_sample_size, and nodes. #1932

Closed dchang56 closed 4 years ago

dchang56 commented 4 years ago

In line 1315 of sampler.cc, there's a check as follows:

CHECK_EQ(num_chunks * neg_sample_size, global_neg_vids.size());

I'm wondering why you'd want num_chunksneg_sample_size == global_neg_vids.size(), and not num_chunksneg_sample_size == num_neg_edges

If the neg_sample_size and num_chunks determine the total number of negative edges you want to sample, then shouldn't it be the latter?

I'd appreciate any insight on this.

What's weirder is that I used the sampler to do 1:1 sampling (neg_sample_size=1, chunk_size=1) on the WN18 validation set to do triple classification, and it worked. However, when I tried doing the same thing for my custom dataset, it failed the CHECK_EQ line I mentioned above.

The size of valid set for WN18 is 5000, and given neg_sample_size=1 and chunk_size=1, it should produce a pos_g of size 5000 and neg_g of size 5000. It should also have failed the CHECK_EQ line because the num_chunks would be num_pos_edges/chunk_size=5000, and 5000*1 != global_neg_vids.size() which is 40943, but it does not fail and works just fine.

In contrast, for my custom dataset, the valid set size is 293713, neg_sample_size=1, and chunk_size=1. So num_chunks is 293713, and 293713*1 != 97238 (the number of nodes), so it fails.

The exact error message is: Check failed: num_chunks * neg_sample_size == global_neg_vids.size() (293713 vs. 97238)

Is this a bug? I'd appreciate any insights!

classicsong commented 4 years ago

The negative edges is created by corrupting the head or tail vertices of the positive edge. So we are actually sampling negative vertices.

For genChunkedNegEdgeSubgraph, there is a nge_mode indicating whether we are corrupting head vertices or tail vertices.

dchang56 commented 4 years ago

I understand that for the link prediction evaluation task we sample all possible nodes (global_neg_vids.size()) for each positive triple, but I'm not sure if that explains why in this case num_chunks*neg_sample_size must equal global_neg_vids.size(). What about cases like triple classification where you only want to sample 1 negative edge for each positive triple? That's what I'm trying to do. I think I circumvented the problem by using "head" or "tail" instead of "chunk-head" or "chunk-tail" as the mode, thought I'm not exactly sure why.

Am I making any sense, or is there something I'm missing?

classicsong commented 4 years ago

This sampler is designed for dgl-ke, a specific knowledge graph embedding toolkit, to sample negative edges for positive edges. I don;t think this can work with triple classification.