amazon-science / tgl

Apache License 2.0
192 stars 31 forks source link

Inquiry on the inference (getting embedding on the test dataset, in a separate edges.csv file) #12

Open HankyuJang opened 2 years ago

HankyuJang commented 2 years ago

Hello,

I'm currently using your code to learn embeddings. If my understanding is correct, it seems like I need to provide one single 'edges.csv' file that contains both train data and test data, where the node indices of src increments from 0 to x-1, then node indices of dst increments from x to x'. Here x is the number of unique src node id and x'-x+1 is the number of unique dst node id. Could you please confirm that my understanding is correct?

Now here's my follow up question. Do you have implementation that does training and inference separately? For instance, you have a edges.csv file that contain only training and validation data, and you use another file, say edges_test.csv file that only contain test data for inference? If I'd like to proceed in this way, would the following be sufficient? (i) prepare the edges.csv file with the node indices as described above, and (ii) prepare edges_test.csv such that if a node appears in the edges.csv, use that index, otherwise prepare node indices incrementally from x'+1?

tedzhouhk commented 2 years ago

Hi, in 'edges.csv', the src and dst nodes do not necessarily need to have index 0-x and x-x'. If you want a separate 'edges_test.csv', you can set the eval_df as the new csv file. For node and edge features, You need to keep a constant idx for node and edge features in 'edges.csv' and 'edges_test.csv', or provide separate node_feats and edge_feats to the prepare_input function.

HankyuJang commented 2 years ago

Thank you so much for the quick response! I'll proceed it by providing a separate edges_test.csv and follow the steps as you suggested.