Open serenalotreck opened 4 months ago
Hello,
Thanks for reporting.
Yes, you are right. src and dst are the two nodes involved in the edge. For ext_roll, if you looked at the train.py, you could find out that ext_roll is used to split the data into training, validation and testing sets, in which 0 indicates the training set.
Since you have all those nodes in text, I will suggest to maintain a word-interger mapping list if you wanna use the code without changing. Otherwise, rewriting the gen_graph.py might be another option.
We use relative time encoding (time difference) in this study, thus you are good to use the years directly.
Thanks for the quick response! I'll probably just do a mapping rather than change the code.
I looked at the data pre-processing step (gen_graph.py
) and it looks like you already have to have the ext_roll
column, and that there's no code to perform the train/test split. In the paper, it says "In this study, we use chronological split for training, validation, and test sets (7/1.5/1.5)" -- is it sufficient to just sort my edgelist chronologically and make the splits?
The train/test indicators are set in 'train.py' at lines 65 and 66, while the split action is carried out in 'link_pred_train_utils.py' at lines 20, 27, and 34.
And yes, it's important to make sure the edgelist is sorted. Howerer, the split ratio is a variable parameter.
Awesome, thanks! I've formatted my data and am going to try running STHN.
One question I have is, are the outputs saved anywhere? I ran the code with the example movies dataset, and all I see is the printed performance values for each epoch. I'm interested in, once I train the model, using it to predict the values in the graph for one year beyond my actual dataset (i.e. test on 2023 values which I have, then predict 2024 values, which I don't have). Is that possible?
No, the outputs were only printed. You can save the result to a file if you want.
Based on my understanding, the 2024 values can be obtained by feeding the 2023 values into the trained model.
Great thank you!
One other question -- it's not clear to me from the paper whether my network needs to be directed or undirected. I currently have an undirected network -- is that okay with the assumptions of the algorithm?
In general, most of the existing graph neural networks are designed for undirected graph. And the logic is letting nodes update their representations by aggregating the information from their neighbors.
For example, if the edge is undirected (A-B), then both A and B can receive the information from the other end of the edge. Else if the edge is directed (A->B), then B can accept the information from A, but in some cases B will be blocked for A's representation update.
Ok perfect, thanks for the clarification!
In terms of applying the trained model on new data -- there are three pickle
files in my output directory after training:
test_neg_sample_neg1_bs600_hops1_neighbors50.pickle
train_neg_sample_neg5_bs600_hops1_neighbors50.pickle
valid_neg_sample_neg1_bs600_hops1_neighbors50.pickle
Which one of these should I read in to make new predictions for 2024? Do you have any pointers on the practical implementation of this (i.e., what do I do once I've read in the pickle?)
Thanks!
Those three are pre-processed files. If you wanna get the prediction for 2024, you may consider to save the pred in line 106 of link_pred_train_utils.py during testing.
loss, pred, edge_label = model(inputs, neg_samples, subgraph_node_feats)
Thanks! I ended up saving the model weights at the end of training so that I could interactively make predictions in a Jupyter notebook. My data only goes to 2023, so I don't think I could get the 2024 predictions by saving the predictions for the test set. Thus far, I've been able to read in the model with the following code:
## Printed these from when I trained the model
edge_predictor_configs = {'dim_in_time': 100,
'dim_in_node': 0,
'predict_class': 4}
mixer_configs = {'per_graph_size': 50,
'time_channels': 100,
'input_channels': 3,
'hidden_channels': 100,
'out_channels': 100,
'num_layers': 1,
'dropout': 0.1,
'channel_expansion_factor': 2,
'window_size': 5,
'use_single_layer': False}
loaded_state_dict = torch.load('../../STHN/DATA/drought_desiccation/trained_dt_model')
sthn_model = STHN_Interface(mixer_configs, edge_predictor_configs)
current_model_dict = sthn_model.state_dict()
## Had to do this to avoid a size mismatch on edge_predictor.out_fc.weight and edge_predictor.out_fc.bias
new_state_dict = {k:v if v.size()==current_model_dict[k].size() else current_model_dict[k] for k,v in zip(current_model_dict.keys(), loaded_state_dict.values())}
sthn_model.load_state_dict(new_state_dict, strict=False)
Now, I am trying to figure out how to make predictions. I'm looking at line 106 like you suggested, but I'm not sure what inputs
, neg_samples
, and subgraph_node_feats
should be.
What I would like to do is provide the whole graph, and then get a list of predictions for the next year (2024). Do you have thoughts on how I would do this?
Also, I took a look at the pred
and edge_label
outputs from line 106, and they're both tensors. I'll look through the code but it would save me a lot of time if you could let me know, how do I get to a human-readable node pair and edge label?
I'd like to use STHN on my own data. I see that I should format it like the example data; however, I'm not sure what all the columns are in the example data:
I assume that
src
anddst
are the two nodes invovled in the edge, andtime
andlabel
are relatively straightforward. However, what isext_roll
?Also, if my nodes and edges are all strings (words from text), can I pass them as strings, or do I need to cast them all to integers?
If my timestamps are all years, can I pass the years directly, or do I need to start them at 0?