The Cora dataset maybe wrong

chencsgit commented 3 years ago

Hi authors, I have read your paper, which is quite interesting. Thank you for your great work.

But I have a question about the split of Cora Dataset.

I count the node number of train_mask, val_mask, test_mask in https://github.com/chennnM/GCNII/blob/ca91f5686c4cd09cc1c6f98431a5d5b7e36acc92/process.py#L157 which are 1192, 796, 497. The sum of nodes [train_mask, val_mask, test_mask] is not 2,708, which is different from nodes shown in your paper.

You can reproduce this phenomenon by the code: print('train_mask is %s' %train_mask.numpy().sum()) print('val_mask is %s' %val_mask.numpy().sum()) print('test_mask is %s' % test_mask.numpy().sum())

I don't understand why this happen. Could you please point out? Hope for your response. Thanks!

chencsgit commented 3 years ago

The citeseer and pubmed dataset are right, only the cora dataset have this problem.

chennnM commented 3 years ago

The dataset split we used comes from GEOM-GCN, and some nodes are not used for training, verification, or testing, which is a mistake. Thank you for pointing this out. I didn't check the data split before, but I ensured that all baseline used the same data split.

chencsgit commented 3 years ago

Thank you for your reply.

chennnM / GCNII

The Cora dataset maybe wrong #10