Closed Jaykim148 closed 2 months ago
I was able to run your code with latest DGL and encountered the same issue after enabling negative sampling.
https://colab.research.google.com/drive/1ydHac4edafYsOl4dqIN05uirRnArc8I1?usp=sharing
sample_uniform_negative requires the item sampler returns N*2 shaped tensor, and each row with 2 items as src and dst. While in your case:
seeds={'user:like:item': tensor([[890374, 950250, 577883, ..., 173522, 504136, 545573],
[ 13905, 977182, 56186, ..., 395520, 218738, 729982]],
seeds are 2 * N.
We did not fix the length of each row to 2 because there are use cases for hyperedge, which may have multiple nodes, that's why it could cause confusion.
I was able to run your code with latest DGL and encountered the same issue after enabling negative sampling.
sample_uniform_negative requires the item sampler returns N*2 shaped tensor, and each row with 2 items as src and dst. While in your case:
seeds={'user:like:item': tensor([[890374, 950250, 577883, ..., 173522, 504136, 545573], [ 13905, 977182, 56186, ..., 395520, 218738, 729982]],
seeds are 2 * N.
We did not fix the length of each row to 2 because there are use cases for hyperedge, which may have multiple nodes, that's why it could cause confusion.
sample_uniform_negative requires the item sampler returns N*2 shaped tensor, and each row with 2 items as src and dst. While in your case:
seeds={'user:like:item': tensor([[890374, 950250, 577883, ..., 173522, 504136, 545573], [ 13905, 977182, 56186, ..., 395520, 218738, 729982]],
seeds are 2 * N.
We did not fix the length of each row to 2 because there are use cases for hyperedge, which may have multiple nodes, that's why it could cause confusion.
The seeds come from OndiskDataset.task.train_set. I believe that for link prediction, the itemset from OndiskDataset should generate results that work with later pipeline stages. I also reported two other issues:
These problems might be related to the data shape from OndiskDataset.task.train_set (2*N instead of N*2).
I just confirmed that all the issues that I mentioned in this report were from input data shape. I converted the data shape from the OndiskDataset.task.train_set using the following code and works without any issue that I mentioned above.
item_set = gb.ItemSetDict(
{key: gb.ItemSet((val._items[0].T, ), names=('seeds',)) for key, val in dataset.tasks[0].train_set._itemsets.items()})
I just confirmed that all the issues that I mentioned in this report were from input data shape. I converted the data shape from the OndiskDataset.task.train_set using the following code and works without any issue that I mentioned above.
item_set = gb.ItemSetDict( {key: gb.ItemSet((val._items[0].T, ), names=('seeds',)) for key, val in dataset.tasks[0].train_set._itemsets.items()})
A better way to resolve could be to save the edges that go into the itemset in the transposed way separately.
A better way to resolve could be to save the edges that go into the itemset in the transposed way separately.
There is no way to change the numpy array shape from the beginning. The numpy array should have (2, N) shape based on this doc (https://docs.dgl.ai/en/2.1.x/stochastic_training/ondisk_dataset_heterograph.html)
A better way to resolve could be to save the edges that go into the itemset in the transposed way separately.
There is no way to change the numpy array shape from the beginning. The numpy array should have (2, N) shape based on this doc (https://docs.dgl.ai/en/2.1.x/stochastic_training/ondisk_dataset_heterograph.html)
See the updated notebook: https://colab.research.google.com/drive/1ydHac4edafYsOl4dqIN05uirRnArc8I1?usp=sharing
There is no way to change the numpy array shape from the beginning. The numpy array should have (2, N) shape based on this doc (https://docs.dgl.ai/en/2.1.x/stochastic_training/ondisk_dataset_heterograph.html)
Oh, I got it. I thought that all the edges should be (2, N) shape including the training set. But, (2, N) shape is required only for defining graph edges. Thank you for debugging.
🐛 Bug
I observed several bugs related to itemsampler with ondiskdataset
To Reproduce
Steps to reproduce the behavior:
code sample:
Error message with sample_uniform_negative
AssertionError: Only tensor with shape N*2 is supported for negative sampling, but got torch.Size([2, 4000000]).
Error from different numbers of nodes by node types (user_num_nodes = 10^6, item_num_nodes = 10^3)
Generated minibatch result
Expected behavior
Environment
Additional context