Open wyh-Dreamer opened 3 months ago
The code has self.class_dict
and data.retrieve_class
to match the labels and node index. It's nothing about the enumeration order.
The code has
self.class_dict
anddata.retrieve_class
to match the labels and node index. It's nothing about the enumeration order.
Thank you for your reply! I know what you explain. The function of self.class_dict
and data.retrieve_class
is feature sampling from the large-real graph randomly according to the labels. Is it right?
However, what I want to ask is the feature initialization process using label information on synthetic graph. You can read the function generate_labels_syn
that generates syn labels by enumerating sorted_counter
, but the function get_sub_adj_feat
that generates syn features by enumerating data.nclass
. Isn't there an index mismatch problem between initialization features and initialization labels here?
I'm sorry that your answer doesn't seem to help me solve this problem.T_T
The code has
self.class_dict
anddata.retrieve_class
to match the labels and node index. It's nothing about the enumeration order.
I take the Cora dataset as an example.
The order of enumerating data.nclass
is [0, 1, 2, 3, 4, 5, 6].
The order of enumerating sorted_counter
is [6, 1, 5, 0, 4, 2, 3].
For example, this problem will lead to feat_syn[0]
is a sampling feature from class 0, but its corresponding label on synthetic graph is class 6. Is it right?
The code has
self.class_dict
anddata.retrieve_class
to match the labels and node index. It's nothing about the enumeration order.I take the Cora dataset as an example. The order of enumerating
data.nclass
is [0, 1, 2, 3, 4, 5, 6]. The order of enumeratingsorted_counter
is [6, 1, 5, 0, 4, 2, 3]. For example, this problem will lead tofeat_syn[0]
is a sampling feature from class 0, but its corresponding label on synthetic graph is class 6. Is it right?
Sorry for the late response. Thank you for pointing out! I agree with your observation and it's not consistent. It'd be better to change the get_sub_adj_feat
function to:
def get_sub_adj_feat(self, features):
data = self.data
args = self.args
idx_selected = []
from collections import Counter;
counter = Counter(self.labels_syn.cpu().numpy())
sorted_counter = sorted(counter.items(), key=lambda x: x[1])
for ix, (c, num) in enumerate(sorted_counter):
# for c in range(data.nclass):
tmp = data.retrieve_class(c, num=counter[c])
tmp = list(tmp)
idx_selected = idx_selected + tmp
idx_selected = np.array(idx_selected).reshape(-1)
features = features[self.data.idx_train][idx_selected]
# adj_knn = torch.zeros((data.nclass*args.nsamples, data.nclass*args.nsamples)).to(self.device)
# for i in range(data.nclass):
# idx = np.arange(i*args.nsamples, i*args.nsamples+args.nsamples)
# adj_knn[np.ix_(idx, idx)] = 1
# from sklearn.metrics.pairwise import cosine_similarity
# # features[features!=0] = 1
# k = 2
# sims = cosine_similarity(features.cpu().numpy())
# sims[(np.arange(len(sims)), np.arange(len(sims)))] = 0
# for i in range(len(sims)):
# indices_argsort = np.argsort(sims[i])
# sims[i, indices_argsort[: -k]] = 0
# adj_knn = torch.FloatTensor(sims).to(self.device)
return features, None
However, in my test on Cora, it makes no difference. I believe the learning process significantly changes the features, so the initialization is not such important. For example, some methods like GCSNTK also use the Gaussian noise as initialization.
However, in my test on Cora, it makes no difference. I believe the learning process significantly changes the features, so the initialization is not such important. For example, some methods like GCSNTK also use the Gaussian noise as initialization.
I agree with your viewpoint. I also found that the performance results between two codes were similar when running the code.
However, due to time constraints, I only used the Cora dataset when implementing this paper. The feature matrix of this dataset is a very very sparse matrix, which may result in edges being much more important than node features. If we switch to datasets which have rich feature information such as ogbn-arxiv, the performance may be different.~^_^
But ultimately, I believe that open source code and paper theory must be consistent, as I have seen other papers such as HGCond that have identified your feature initialization based on label information as a very important innovation point and made the improvement based on it.
Thanks again for remind us of this! I agree that open source code and paper theory must be consistent. I'll try arXiv later and pin you here.
Thanks again for remind us of this! I agree that open source code and paper theory must be consistent. I'll try arXiv later and pin you here.
Very Good. As a beginner in deep learning, I am particularly glad to make contributions to your great work.^_^
I take the Cora dataset as an example. The synthetic features generate in the function "get_sub_adj_feat". The order of the class enumeration is [0, 1, 2, 3, 4, 5, 6]. However, The synthetic labels in the function "generate_labels_syn". The order of the class enumeration is [6, 1, 5, 0, 4, 2, 3]. This problem will cause the mismatch between "feat_syn" and "labels_syn".
Is this a bug or my analysis error?