Open naskk1 opened 1 month ago
Hi, you can find the code to generate sbert embedding in utils/data_process.py
get_sbert_embedding()
function.
Thank you for your reply, but what is the input text template for embedding? I think this is different from the paradigm for Q&A. I have tried some, but it doesn't work well.
Hi,
May I ask how one would obtain the .jsonl
files? Also the file processed_data.pt
for a new dataset?
Thank you for your time.
Hi, due to variations in the raw data formats across different datasets, we don't have a single unified function for generating processed_data.pt
. To create processed_data.pt
for new datasets, you only need to generate a Data
instance in PyG format, ensuring that edge_index
is included in this instance. And ensuring data.label_texts
to include all label name, data.raw_texts
to include node text feature if you want to train on node description task.
In general, we follow the guidelines from this repo to generate the edge_index
.
To create the *.jsonl
file, the main task is to generate the node sequence that represents the structure surrounding each node using template in *.jsonl
. You can use our get_fix_shape_subgraph_sequence_fast
function in utils/data_process.py
to generate the node sequence.
After obtaining the processed_data.pt
, sampled_2_10_test.jsonl
, sampled_2_10_train.jsonl
, and sampled_2_10_val.jsonl
files for a new dataset, could you please let me know what additional steps I should take to run experiments in the "single focus" setting for the node classification task on the new dataset? And what specific model on Hugging Face corresponds to the term "simteg"?
Thank you for your time!
If I want to train model on other dataset, how can I get model parameter file such as 'sbert_x.pt' in the dataset, it seems that there are no code for this.