VITA-Group / LLaGA

[ICML2024] "LLaGA: Large Language and Graph Assistant", Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, Zhangyang Wang
Apache License 2.0
82 stars 3 forks source link

How to get 'sbert_x.pt' and other files for new dataset ? #19

Open naskk1 opened 1 month ago

naskk1 commented 1 month ago

If I want to train model on other dataset, how can I get model parameter file such as 'sbert_x.pt' in the dataset, it seems that there are no code for this.

ChenRunjin commented 1 month ago

Hi, you can find the code to generate sbert embedding in utils/data_process.py get_sbert_embedding() function.

naskk1 commented 1 month ago

Thank you for your reply, but what is the input text template for embedding? I think this is different from the paradigm for Q&A. I have tried some, but it doesn't work well.

ManuelSerna commented 1 month ago

Hi,

May I ask how one would obtain the .jsonl files? Also the file processed_data.pt for a new dataset?

Thank you for your time.

ChenRunjin commented 1 month ago

Hi, due to variations in the raw data formats across different datasets, we don't have a single unified function for generating processed_data.pt. To create processed_data.pt for new datasets, you only need to generate a Data instance in PyG format, ensuring that edge_index is included in this instance. And ensuring data.label_texts to include all label name, data.raw_texts to include node text feature if you want to train on node description task.

In general, we follow the guidelines from this repo to generate the edge_index.

To create the *.jsonl file, the main task is to generate the node sequence that represents the structure surrounding each node using template in *.jsonl. You can use our get_fix_shape_subgraph_sequence_fast function in utils/data_process.py to generate the node sequence.

honey0219 commented 1 month ago

After obtaining the processed_data.pt, sampled_2_10_test.jsonl, sampled_2_10_train.jsonl, and sampled_2_10_val.jsonl files for a new dataset, could you please let me know what additional steps I should take to run experiments in the "single focus" setting for the node classification task on the new dataset? And what specific model on Hugging Face corresponds to the term "simteg"? Thank you for your time!