dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.43k stars 3k forks source link

How to create own Dataset like builtin? What is train_mask for? #5001

Closed eric2213 closed 1 year ago

eric2213 commented 1 year ago

Hello, I'm new to machine learning, I'm trying to train RGCN with my own dataset, but I don't know how to create 'train_mask" .

In RGCN/link.py, when I print (g.edata) of FB15k237Dataset, the output masks have these edge data with boolean tensor :

'train_mask': tensor([ True,  True,  True,  ..., False, False, False]), 
'test_mask': tensor([False, False, False,  ...,  True,  True,  True]), 
'val_mask': tensor([False, False, False,  ..., False, False, False])
'train_edge_mask': tensor([ True,  True,  True,  ..., False, False, False]), 
'valid_edge_mask': tensor([False, False, False,  ..., False, False, False]), 
'test_edge_mask': tensor([False, False, False,  ...,  True,  True,  True]), 

I'm wondering what are these masks for? Does it represent the True fact and False fact? (Triplets and negative triplets?) or is it created randomly from code like below: g.edata['train_mask'] = torch.zeros(1000, dtype=torch.bool).bernoulli(0.6)

image

I picked some triplets from FB15K-237 for examples , here is how I create a heterogeneous graph, is this a proper way?

data_dict = {
    ('entity', '/travel/travel_destination/climate./travel/travel_destination_monthly_climate/month', 'entity'): (torch.tensor([0]), torch.tensor([1])),
    ('entity', '/music/performance_role/regular_performances./music/group_membership/group', 'entity'): (torch.tensor([2]), torch.tensor([3])),
    ('entity', '/location/location/contains', 'entity'): (torch.tensor([4, 4]), torch.tensor([5, 6]))
}

g = dgl.heterograph(data_dict)

with this heterograph, how do I create masks and split data into train-valid-test like builtin dataset?

不知道能不能用中文问,我的英文太差了... 我不了解mask的用意,我在其他教学有看到可用来划分训练集、验证集和测试集,但其tensor是随机产生的 g.edata['train_mask'] = torch.zeros(1000, dtype=torch.bool).bernoulli(0.6)

想请问若以原始的FB15K237数据为例 image 我该如何创造mask,将数据集弄得跟内建的FB15k237Dataset一样,才可以直接给RGCN里的link.py来使用?

程式新手,问的问题可能很浅白,请见谅,先感谢回覆了,谢谢!

czkkkkkk commented 1 year ago

Hi @eric2213 .

peizhou001 commented 1 year ago

Hi @eric2213, train/test/val mask a bool tensor used to determine whether the data at same index is choose for train/test/val. For built-in FB15K237, you can get the tutorial FB15k237Dataset.

I'm not sure what your test/train/valid text contains, if it is the mask/idx, you can directly read and use it.

Or if you want to create it yourself, it is a simple random split of original nodes/edges.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

czkkkkkk commented 1 year ago

Hi @eric2213 , I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.