The Mismatch Problem in Synthetic Features and Labels

wyh-Dreamer commented 3 months ago

I take the Cora dataset as an example. The synthetic features generate in the function "get_sub_adj_feat". The order of the class enumeration is [0, 1, 2, 3, 4, 5, 6]. However, The synthetic labels in the function "generate_labels_syn". The order of the class enumeration is [6, 1, 5, 0, 4, 2, 3]. This problem will cause the mismatch between "feat_syn" and "labels_syn".

Is this a bug or my analysis error?

rockcor commented 2 months ago

The code has self.class_dict and data.retrieve_class to match the labels and node index. It's nothing about the enumeration order.

wyh-Dreamer commented 2 months ago

The code has self.class_dict and data.retrieve_class to match the labels and node index. It's nothing about the enumeration order.

Thank you for your reply! I know what you explain. The function of self.class_dict and data.retrieve_class is feature sampling from the large-real graph randomly according to the labels. Is it right?

However, what I want to ask is the feature initialization process using label information on synthetic graph. You can read the function generate_labels_syn that generates syn labels by enumerating sorted_counter, but the function get_sub_adj_feat that generates syn features by enumerating data.nclass. Isn't there an index mismatch problem between initialization features and initialization labels here?

I'm sorry that your answer doesn't seem to help me solve this problem.T_T

wyh-Dreamer commented 2 months ago

The code has self.class_dict and data.retrieve_class to match the labels and node index. It's nothing about the enumeration order.

I take the Cora dataset as an example. The order of enumerating data.nclass is [0, 1, 2, 3, 4, 5, 6]. The order of enumerating sorted_counter is [6, 1, 5, 0, 4, 2, 3]. For example, this problem will lead to feat_syn[0] is a sampling feature from class 0, but its corresponding label on synthetic graph is class 6. Is it right?

rockcor commented 2 months ago

The code has self.class_dict and data.retrieve_class to match the labels and node index. It's nothing about the enumeration order.

I take the Cora dataset as an example. The order of enumerating data.nclass is [0, 1, 2, 3, 4, 5, 6]. The order of enumerating sorted_counter is [6, 1, 5, 0, 4, 2, 3]. For example, this problem will lead to feat_syn[0] is a sampling feature from class 0, but its corresponding label on synthetic graph is class 6. Is it right?

Sorry for the late response. Thank you for pointing out! I agree with your observation and it's not consistent. It'd be better to change the get_sub_adj_feat function to:

 def get_sub_adj_feat(self, features):
        data = self.data
        args = self.args
        idx_selected = []

        from collections import Counter;
        counter = Counter(self.labels_syn.cpu().numpy())
        sorted_counter = sorted(counter.items(), key=lambda x: x[1])
        for ix, (c, num) in enumerate(sorted_counter):
            # for c in range(data.nclass):
            tmp = data.retrieve_class(c, num=counter[c])
            tmp = list(tmp)
            idx_selected = idx_selected + tmp
        idx_selected = np.array(idx_selected).reshape(-1)
        features = features[self.data.idx_train][idx_selected]

        # adj_knn = torch.zeros((data.nclass*args.nsamples, data.nclass*args.nsamples)).to(self.device)
        # for i in range(data.nclass):
        #     idx = np.arange(i*args.nsamples, i*args.nsamples+args.nsamples)
        #     adj_knn[np.ix_(idx, idx)] = 1

        # from sklearn.metrics.pairwise import cosine_similarity
        # # features[features!=0] = 1
        # k = 2
        # sims = cosine_similarity(features.cpu().numpy())
        # sims[(np.arange(len(sims)), np.arange(len(sims)))] = 0
        # for i in range(len(sims)):
        #     indices_argsort = np.argsort(sims[i])
        #     sims[i, indices_argsort[: -k]] = 0
        # adj_knn = torch.FloatTensor(sims).to(self.device)
        return features, None

However, in my test on Cora, it makes no difference. I believe the learning process significantly changes the features, so the initialization is not such important. For example, some methods like GCSNTK also use the Gaussian noise as initialization.

wyh-Dreamer commented 2 months ago

However, in my test on Cora, it makes no difference. I believe the learning process significantly changes the features, so the initialization is not such important. For example, some methods like GCSNTK also use the Gaussian noise as initialization.

I agree with your viewpoint. I also found that the performance results between two codes were similar when running the code.

However, due to time constraints, I only used the Cora dataset when implementing this paper. The feature matrix of this dataset is a very very sparse matrix, which may result in edges being much more important than node features. If we switch to datasets which have rich feature information such as ogbn-arxiv, the performance may be different.~^_^

But ultimately, I believe that open source code and paper theory must be consistent, as I have seen other papers such as HGCond that have identified your feature initialization based on label information as a very important innovation point and made the improvement based on it.

rockcor commented 2 months ago

Thanks again for remind us of this! I agree that open source code and paper theory must be consistent. I'll try arXiv later and pin you here.

wyh-Dreamer commented 2 months ago

Thanks again for remind us of this! I agree that open source code and paper theory must be consistent. I'll try arXiv later and pin you here.

Very Good. As a beginner in deep learning, I am particularly glad to make contributions to your great work.^_^

ChandlerBang / GCond

The Mismatch Problem in Synthetic Features and Labels #17