New data formatting - Githubissues

serenalotreck commented 4 months ago

I'd like to use STHN on my own data. I see that I should format it like the example data; however, I'm not sure what all the columns are in the example data:

idx	src	dst	time	label	ext_roll

I assume that src and dst are the two nodes invovled in the edge, and time and label are relatively straightforward. However, what is ext_roll?

Also, if my nodes and edges are all strings (words from text), can I pass them as strings, or do I need to cast them all to integers?

If my timestamps are all years, can I pass the years directly, or do I need to start them at 0?

celi52 commented 4 months ago

Hello,

Thanks for reporting.

Yes, you are right. src and dst are the two nodes involved in the edge. For ext_roll, if you looked at the train.py, you could find out that ext_roll is used to split the data into training, validation and testing sets, in which 0 indicates the training set.

Since you have all those nodes in text, I will suggest to maintain a word-interger mapping list if you wanna use the code without changing. Otherwise, rewriting the gen_graph.py might be another option.

We use relative time encoding (time difference) in this study, thus you are good to use the years directly.

serenalotreck commented 4 months ago

Thanks for the quick response! I'll probably just do a mapping rather than change the code.

I looked at the data pre-processing step (gen_graph.py) and it looks like you already have to have the ext_roll column, and that there's no code to perform the train/test split. In the paper, it says "In this study, we use chronological split for training, validation, and test sets (7/1.5/1.5)" -- is it sufficient to just sort my edgelist chronologically and make the splits?

celi52 commented 4 months ago

The train/test indicators are set in 'train.py' at lines 65 and 66, while the split action is carried out in 'link_pred_train_utils.py' at lines 20, 27, and 34.

And yes, it's important to make sure the edgelist is sorted. Howerer, the split ratio is a variable parameter.

serenalotreck commented 4 months ago

Awesome, thanks! I've formatted my data and am going to try running STHN.

One question I have is, are the outputs saved anywhere? I ran the code with the example movies dataset, and all I see is the printed performance values for each epoch. I'm interested in, once I train the model, using it to predict the values in the graph for one year beyond my actual dataset (i.e. test on 2023 values which I have, then predict 2024 values, which I don't have). Is that possible?

celi52 commented 4 months ago

No, the outputs were only printed. You can save the result to a file if you want.

Based on my understanding, the 2024 values can be obtained by feeding the 2023 values into the trained model.

serenalotreck commented 4 months ago

Great thank you!

One other question -- it's not clear to me from the paper whether my network needs to be directed or undirected. I currently have an undirected network -- is that okay with the assumptions of the algorithm?

celi52 commented 4 months ago

In general, most of the existing graph neural networks are designed for undirected graph. And the logic is letting nodes update their representations by aggregating the information from their neighbors.

For example, if the edge is undirected (A-B), then both A and B can receive the information from the other end of the edge. Else if the edge is directed (A->B), then B can accept the information from A, but in some cases B will be blocked for A's representation update.

serenalotreck commented 4 months ago

Ok perfect, thanks for the clarification!

In terms of applying the trained model on new data -- there are three pickle files in my output directory after training:

test_neg_sample_neg1_bs600_hops1_neighbors50.pickle 
train_neg_sample_neg5_bs600_hops1_neighbors50.pickle 
valid_neg_sample_neg1_bs600_hops1_neighbors50.pickle

Which one of these should I read in to make new predictions for 2024? Do you have any pointers on the practical implementation of this (i.e., what do I do once I've read in the pickle?)

Thanks!

celi52 commented 4 months ago

Those three are pre-processed files. If you wanna get the prediction for 2024, you may consider to save the pred in line 106 of link_pred_train_utils.py during testing.

loss, pred, edge_label = model(inputs, neg_samples, subgraph_node_feats)

serenalotreck commented 4 months ago

Thanks! I ended up saving the model weights at the end of training so that I could interactively make predictions in a Jupyter notebook. My data only goes to 2023, so I don't think I could get the 2024 predictions by saving the predictions for the test set. Thus far, I've been able to read in the model with the following code:

## Printed these from when I trained the model
edge_predictor_configs = {'dim_in_time': 100,
                          'dim_in_node': 0,
                          'predict_class': 4}
mixer_configs =  {'per_graph_size': 50,
                  'time_channels': 100,
                  'input_channels': 3,
                  'hidden_channels': 100,
                  'out_channels': 100,
                  'num_layers': 1,
                  'dropout': 0.1,
                  'channel_expansion_factor': 2,
                  'window_size': 5,
                  'use_single_layer': False}

loaded_state_dict = torch.load('../../STHN/DATA/drought_desiccation/trained_dt_model')
sthn_model = STHN_Interface(mixer_configs, edge_predictor_configs)
current_model_dict = sthn_model.state_dict()
## Had to do this to avoid a size mismatch on edge_predictor.out_fc.weight and edge_predictor.out_fc.bias
new_state_dict = {k:v if v.size()==current_model_dict[k].size()  else  current_model_dict[k] for k,v in zip(current_model_dict.keys(), loaded_state_dict.values())}
sthn_model.load_state_dict(new_state_dict, strict=False)

Now, I am trying to figure out how to make predictions. I'm looking at line 106 like you suggested, but I'm not sure what inputs, neg_samples, and subgraph_node_feats should be.

What I would like to do is provide the whole graph, and then get a list of predictions for the next year (2024). Do you have thoughts on how I would do this?

serenalotreck commented 4 months ago

Also, I took a look at the pred and edge_label outputs from line 106, and they're both tensors. I'll look through the code but it would save me a lot of time if you could let me know, how do I get to a human-readable node pair and edge label?

celi52 / STHN

New data formatting #1