KeepMovingXX / DyGIS

12 stars 0 forks source link

Several important problems #1

Open trysohard-git opened 1 week ago

trysohard-git commented 1 week ago

Sorry, I reviewed the code carefully, and we cannot reproduce the results presented in the paper. Additionally, is it reasonable to use test set data during the information subgraph acquisition process? Moreover, in the following code: if prediction == True: z_hiddent, = self.deccode(prior_mean_t, edge_idx_list[t]) all_test_rec_h.append(z_hidden_t) Using all_test_rec_h for link prediction tasks means we already know the connection status at the current time. Is this link prediction or link detection task?But you use it to prediction task, i think the setting give your results higher than real performence

KeepMovingXX commented 1 week ago

Hi, thank you for your attention.

  1. We have reviewed the code and updated the settings for link prediction, adding a regularization term for the loss to maintain temporal dependence. Now the reported results run correctly.

  2. Our learning subgraph process can be seen as a masking strategy. For convenience, we input the entire dynamic graph into this masking strategy to obtain the informative subgraph. In the final testing phase of the model performance, we use the entire dynamic graph to obtain node embeddings without utilizing the information from the learned subgraph of the test dataset. It does not significantly affect or improve the model's performance.

  3. This way follows VGRNN. For link prediction, we can only use the hidden state obtained from the previous moment as the node embedding since we need to predict the status of time t using the information from time t−1. For link detection tasks, we can use the current observed data to obtain node embeddings.

trysohard-git commented 5 days ago

Hi, thank you for your reply.

  1. I’m sorry, but I am unable to reproduce the results in your paper at this time.
  2. Yes, I have run the code, and you are correct about that.
  3. You need to take a closer look at your code. When performing link prediction, are you using the edge information from the current time? I find it curious that the link prediction results on some datasets in your paper are higher than those for link detection. Doesn’t that raise any concerns? In your prediction file, pri_means is passed as a parameter to get_roc_score, but in your model DGMAE, it is taken from all_test_rec_h. You should be using all_enc_mean as the representation matrix for predictions; please refer to the VGRNN model code for further details. Additionally, in your code snippet: if prediction == True: z_hiddent, = self.decode(prior_mean_t, edge_idx_list[t]) all_test_rec_h.append(z_hidden_t) You are using edge_idx_list[t], which should not be used in the prediction task.
KeepMovingXX commented 4 days ago

Hi,

  1. Our difference from VGRNN lies in the fact that VGRNN uses an inner product as the decoder, while we use a GCN plus an inner product as the decoder. For convenience, we first perform the GCN step in DGMAE (which is actually the first step of the decoder) and then put the result into the inner product calculation to get the final result. During the Encoder process, the model uses past hidden states to obtain the embedding, and then the embedding is input into the GNN-based decoder (GCN and inner product) to get the final result. That is to say, the model does not observe the information of the current moment when extracting node embeddings (which is consistent with VGRNN), and then due to the existence of the GNN decoder, the current edges need to be used to perform the decoding process. Additionally, many methods perform better on prediction tasks than on detection tasks, and it is not our specific setup that leads to a significant improvement in our approach. For instance, on the DBLP dataset, most DyGNN methods tend to have better performance on prediction than on detection.

  2. For the DBLP dataset, I ran it and obtained the following results :

2024-10-18 20:20:54,499 INFO Namespace(tune_description='', dataset='dblp', epochs=100, epochs_task=1000, lr=0.02, lr_task=0.005, weight_decay=0.0001, weight_decay_task=0, tau=0.7, therold=0.1, h_dim=128, z_dim=32, trade_weight=0.5, u=False, run_times=5, conv_type='GCN', decode_type='GCN', device='cpu', lea_feature=False, lea_feature_dim=512, model_name='DyGIS', n_layers=1, eps=1e-10, clip=10, seq_start=0, spilt_len=3, test_after=50, dropout=0.5, patience=150, task='link_prediction') 2024-10-18 20:23:17,450 INFO Total training time... 142.95s -----Link prediction------ prediction metrics: ROC-AUCs, AP-AUCs: (0.956790216641869, 0.9539986770141186) new prediction metrics: ROC-AUCs, AP-AUCs: (0.9387613338627655, 0.9292285144409925)

2024-10-18 20:24:19,447 INFO Namespace(tune_description='', dataset='dblp', epochs=100, epochs_task=1000, lr=0.02, lr_task=0.005, weight_decay=0.0001, weight_decay_task=0, tau=0.7, therold=0.1, h_dim=128, z_dim=32, trade_weight=0.5, u=False, run_times=5, conv_type='GCN', decode_type='GCN', device='cpu', lea_feature=False, lea_feature_dim=512, model_name='DyGIS', n_layers=1, eps=1e-10, clip=10, seq_start=0, spilt_len=3, test_after=50, dropout=0.5, patience=150, task='link_prediction') 2024-10-18 20:27:25,741 INFO Total training time... 186.29s -----Link prediction------ prediction metrics: ROC-AUCs, AP-AUCs: (0.9618380036082166, 0.9591142345997339) new prediction metrics: ROC-AUCs, AP-AUCs: (0.9339958179115414, 0.9263396552997)

2024-10-18 20:28:02,342 INFO Namespace(tune_description='', dataset='dblp', epochs=100, epochs_task=1000, lr=0.02, lr_task=0.005, weight_decay=0.0001, weight_decay_task=0, tau=0.7, therold=0.1, h_dim=128, z_dim=32, trade_weight=0.5, u=False, run_times=5, conv_type='GCN', decode_type='GCN', device='cpu', lea_feature=False, lea_feature_dim=512, model_name='DyGIS', n_layers=1, eps=1e-10, clip=10, seq_start=0, spilt_len=3, test_after=50, dropout=0.5, patience=150, task='link_prediction') 2024-10-18 20:30:10,072 INFO Total training time... 127.73s -----Link prediction------ prediction metrics: ROC-AUCs, AP-AUCs: (0.9541155106553392, 0.9494399180759944) new prediction metrics: ROC-AUCs, AP-AUCs: (0.938575427320251, 0.932650473924539)

2024-10-18 20:30:48,136 INFO Namespace(tune_description='', dataset='dblp', epochs=100, epochs_task=1000, lr=0.02, lr_task=0.005, weight_decay=0.0001, weight_decay_task=0, tau=0.7, therold=0.1, h_dim=128, z_dim=32, trade_weight=0.5, u=False, run_times=5, conv_type='GCN', decode_type='GCN', device='cpu', lea_feature=False, lea_feature_dim=512, model_name='DyGIS', n_layers=1, eps=1e-10, clip=10, seq_start=0, spilt_len=3, test_after=50, dropout=0.5, patience=150, task='link_prediction') 2024-10-18 20:32:24,511 INFO Total training time... 96.38s -----Link prediction------ prediction metrics: ROC-AUCs, AP-AUCs: (0.9493456082035473, 0.9491184307497663) new prediction metrics: ROC-AUCs, AP-AUCs: (0.9251254665007161, 0.917366458941857)

2024-10-18 20:33:02,119 INFO Namespace(tune_description='', dataset='dblp', epochs=100, epochs_task=1000, lr=0.02, lr_task=0.005, weight_decay=0.0001, weight_decay_task=0, tau=0.7, therold=0.1, h_dim=128, z_dim=32, trade_weight=0.5, u=False, run_times=5, conv_type='GCN', decode_type='GCN', device='cpu', lea_feature=False, lea_feature_dim=512, model_name='DyGIS', n_layers=1, eps=1e-10, clip=10, seq_start=0, spilt_len=3, test_after=50, dropout=0.5, patience=150, task='link_prediction') 2024-10-18 20:34:42,760 INFO Total training time... 100.64s -----Link prediction------ prediction metrics: ROC-AUCs, AP-AUCs: (0.9639436292298403, 0.9617095855789165) new prediction metrics: ROC-AUCs, AP-AUCs: (0.9401165029970663, 0.929883340767638)

-----Run for Link prediction times: 5 prediction metrics muitiple times: ROC-AUCs, AP-AUCs: (0.9572065936677625, 0.0052604346338914814, 0.9546761692037059, 0.0050582832785108535) new prediction metrics muitiple times: ROC-AUCs, AP-AUCs: (0.935314909718468, 0.005497797171775329, 0.9270936886749453, 0.005261274879171547)

trysohard-git commented 4 days ago
  1. As you mentioned, the embedding is subsequently fed into the GNN-based decoder, which utilizes GCN and inner product mechanisms to yield the final outcome. However, the following code snippet is employed to extract the final embedding: if prediction == True: z_hiddent, = self.decode(prior_mean_t, edge_idx_list[t]) all_test_rec_h.append(z_hidden_t) Is edge_idx_list[t] intended to represent the edge state at the current time step?

  2. I noticed that the LD performence is worse than LP performence, so why? it is unnormal, but I could not locate a detailed analysis explaining this difference.

KeepMovingXX commented 4 days ago

1 In this code, _prior_meant represents the node embedding, so we input _prior_meant to self.decode for decoding. In fact, _z_hiddent represents the state after decoding by GNN, not the node embedding. I didn't make it clear in my code due to my confusing variable names, and I'm sorry for the misunderstanding I caused you. 2 In my view, this is partly related to the characteristics of the dataset. For example, this did not occur on the Enron dataset, but it did on DBLP, FB, and Email. Specifically, if there are many redundant edges in the current snapshot of the dataset, using these edges to obtain embeddings may not be as effective as using the embeddings obtained from the hidden state. In fact, VGRNN also exhibits this behavior. VGRNN's LP outperforms LD on the Facebook dataset in their paper. This is not an unnormal occurrence.

trysohard-git commented 4 days ago
  1. You should look at other models,models require an additional decoder after obtaining embeddings? It doesn't matter if you add one, but you also incorporate the current time's edge information into the decoder to predict the edges at the current time. This is a key reason why your model performs significantly better than others.

  2. Do you really think your explanation makes sense? Is it possible that the inclusion of current time edge information leads to higher performance in link prediction compared to link detection? If the new link prediction is still higher than link detection, and you attribute it to dataset characteristics, please double-check your so-called decoder for any issues.

  3. Link prediction uses embeddings from the previous time to predict the connection status for the next time. In this process, you cannot know the status information of the next time. For example, in the testing process with time slices [8, 9, 10], you use the embedding from t=7 to perform self.prior for predicting at t=8. In your code, after obtaining the representation at t=7, you add a decoder but include edge information from t=8. Isn't that the case?