Open sidazhou opened 2 years ago
Narrowed it down to output_nodes.shape
mismatch with mfgs[1].dstnodes().shape
, 100 vs 99. Why is this? Surely it's a bug?
So the issue seems to occur when seed_nodes
contain duplicated id. Is this a bug or a feature?
It seems that the to_block
during sampling will remove duplicated nodes, thus it causes inconsistency between the number of destination nodes and the size of destination node features. @BarclayII I guess we should check possible duplications in seed_nodes
before sampling. What do you think?
Surely it's a bug, right? Because dataloader
is yielding mfgs
that cannot be used as input for model()
Hi, I am also facing this problem. The seed_nodes I input contains some duplicated ids. But I need to get these duplicate embeddings. Is there any solution for now?
I try to use dgl.dataloading.MultiLayerFullNeighborSampler
to sample the blocks for a set of seed_nodes
which contains the duplicated items. If I sample them in CPU, the returned mfg would contain inconsistent results. However, if I sample them in GPU, the duplicated seed nodes would not be removed. I think sample results on different devices should be the same.
import torch
import dgl
src = torch.LongTensor(
[0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10,
1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11])
dst = torch.LongTensor(
[1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11,
0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10])
g = dgl.graph((src, dst))
sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
# Sample in CPU
idx = torch.LongTensor([8,8])
src_nodes, dst_nodes, mfgs = sampler.sample_blocks(g, idx)
print(dst_nodes) # tensor([8, 8])
print(mfgs[-1].num_dst_nodes()) # 1
print(mfgs[-1].dstdata) # {'_ID': tensor([8, 8])}
# Inconsistant
# Sample in GPU
device = torch.device('cuda:0')
src_nodes, dst_nodes, mfgs = sampler.sample_blocks(g.to(device), idx.to(device))
print(dst_nodes) # tensor([8, 8], device='cuda:0')
print(mfgs[-1].num_dst_nodes()) # 2
print(mfgs[-1].dstdata) # {'_ID': tensor([8, 8], device='cuda:0')}
# Consistant
BTW, it seems that sampling in GPU still can' t solve the problem of duplicated nodes in heterograph
import torch
import dgl
import dgl.nn.pytorch as dglnn
src = torch.LongTensor(
[0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10,
1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11])
dst = torch.LongTensor(
[1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11,
0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10])
graph_data = {
('user', 'plays', 'game') : (src, dst),
('user', 'follows', 'user'): (torch.LongTensor([0, 1, 2, 3]), torch.LongTensor([5, 6, 7, 8]))
}
g = dgl.heterograph(graph_data)
g.nodes['user'].data['h'] = torch.ones(g.num_nodes('user'), 16)
g.nodes['game'].data['h'] = torch.ones(g.num_nodes('game'), 16)
sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
uid = torch.LongTensor([0, 0, 2, 2, 4, 4])
device = torch.device('cuda:0')
src_nodes, dst_nodes, mfgs = sampler.sample_blocks(g.to(device), {'user': uid.to(device)})
print(dst_nodes) # {'user': tensor([0, 0, 2, 2, 4, 4], device='cuda:0')}
print(mfgs[-1].num_dst_nodes()) # 6
conv1 = dglnn.HeteroGraphConv({
'plays': dglnn.SAGEConv(16, 32, 'gcn'),
'follows': dglnn.SAGEConv(16, 32, 'gcn')
}, 'sum').to(device)
conv2 = dglnn.HeteroGraphConv({
'plays': dglnn.SAGEConv(32, 32, 'gcn'),
'follows': dglnn.SAGEConv(32, 32, 'gcn')
}, 'sum').to(device)
out = mfgs[0].srcdata['h']
print(mfgs[0].num_dst_nodes()) # 3
print(len(out['game'])) # 0
print(len(out['user'])) # 3
out = conv1(mfgs[0], out)
print(mfgs[1].num_dst_nodes()) # 6
print(len(out['game'])) # 0
print(len(out['user'])) # 3
out = conv2(mfgs[1], out) # Error
Sorry we currently don't support duplicate values in the seed nodes for sampler. We’ve added it to our backlog to get prioritized over other feature requests in our roadmap.
I also met this error on the paper100M dataset. Has this bug been fixed yet? Are there any other potential solutions?
Haven't been solved yet. We suggest users to explicitly deduplicate seed nodes.
🐛 Bug
DGLError('Expected data to have %d rows, got %d.') occurs at large batch__size, and doesnt occur at smaller batch_size. The larger the batch_size the larger the difference in rows. Feels like a rounding error somewhere.
To Reproduce
Expected behavior
Shouldn't DGLError
Environment
Additional context
model:
Error stack: