HKUDS / GraphGPT

[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
https://arxiv.org/abs/2310.13023
Apache License 2.0
493 stars 36 forks source link

Could this model used for other Graph Custom data Finetuning #33

Closed WEIYanbin1999 closed 6 months ago

WEIYanbin1999 commented 6 months ago

If so, how should I organize the graph and place the data file where

WEIYanbin1999 commented 6 months ago

I note your dataset looks like: image Here the value of key "graph", show "node list" "edge index" etc. For me, it is a bit confusing, And I'd like to know where is the graph structure. Such as adjacent matrix or an edge list?

serendipity800 commented 6 months ago

I note your dataset looks like: image Here the value of key "graph", show "node list" "edge index" etc. For me, it is a bit confusing, And I'd like to know where is the graph structure. Such as adjacent matrix or an edge list?

I'm not the author, but I think that is probably some kind of PyG representation of the graph. This part(graph dict in one data point) means a subgraph of a whole graph(like arxiv). The node_idx means the id of the central node in the original graph, and edge_index is edges in the subgraph(node id starting from 0 locally), node list maps the local id of the nodes to global id in the big graph. You can see the subgraph generated by pytorch_geometric.utils.NeighborLoader. For loading the graph dataset, You can see the dataset in the training code. The training dataset will determine which "big" graph this subgraph is in and load it.

WEIYanbin1999 commented 6 months ago

I note your dataset looks like: image Here the value of key "graph", show "node list" "edge index" etc. For me, it is a bit confusing, And I'd like to know where is the graph structure. Such as adjacent matrix or an edge list?

I'm not the author, but I think that is probably some kind of PyG representation of the graph. This part(graph dict in one data point) means a subgraph of a whole graph(like arxiv). The node_idx means the id of the central node in the original graph, and edge_index is edges in the subgraph(node id starting from 0 locally), node list maps the local id of the nodes to global id in the big graph. You can see the subgraph generated by pytorch_geometric.utils.NeighborLoader. For loading the graph dataset, You can see the dataset in the training code. The training dataset will determine which "big" graph this subgraph is in and load it.

Thanks for your well explanation. I understand now.

tjb-tech commented 6 months ago

I'm not the author, but I think that is probably some kind of PyG representation of the graph. This part(graph dict in one data point) means a subgraph of a whole graph(like arxiv). The node_idx means the id of the central node in the original graph, and edge_index is edges in the subgraph(node id starting from 0 locally), node list maps the local id of the nodes to global id in the big graph. You can see the subgraph generated by pytorch_geometric.utils.NeighborLoader. For loading the graph dataset, You can see the dataset in the training code. The training dataset will determine which "big" graph this subgraph is in and load it.

Thanks for your nice explanation🥰

linwhitehat commented 5 months ago

I note your dataset looks like: image Here the value of key "graph", show "node list" "edge index" etc. For me, it is a bit confusing, And I'd like to know where is the graph structure. Such as adjacent matrix or an edge list?

I'm not the author, but I think that is probably some kind of PyG representation of the graph. This part(graph dict in one data point) means a subgraph of a whole graph(like arxiv). The node_idx means the id of the central node in the original graph, and edge_index is edges in the subgraph(node id starting from 0 locally), node list maps the local id of the nodes to global id in the big graph. You can see the subgraph generated by pytorch_geometric.utils.NeighborLoader. For loading the graph dataset, You can see the dataset in the training code. The training dataset will determine which "big" graph this subgraph is in and load it.

May I ask how to go about modifying the whole graph if you want to use the new graph data, e.g. graph_data_all.pt, the author doesn't seem to have mentioned this part of the work

serendipity800 commented 5 months ago

I note your dataset looks like: image Here the value of key "graph", show "node list" "edge index" etc. For me, it is a bit confusing, And I'd like to know where is the graph structure. Such as adjacent matrix or an edge list?

I'm not the author, but I think that is probably some kind of PyG representation of the graph. This part(graph dict in one data point) means a subgraph of a whole graph(like arxiv). The node_idx means the id of the central node in the original graph, and edge_index is edges in the subgraph(node id starting from 0 locally), node list maps the local id of the nodes to global id in the big graph. You can see the subgraph generated by pytorch_geometric.utils.NeighborLoader. For loading the graph dataset, You can see the dataset in the training code. The training dataset will determine which "big" graph this subgraph is in and load it.

May I ask how to go about modifying the whole graph if you want to use the new graph data, e.g. graph_data_all.pt, the author doesn't seem to have mentioned this part of the work

你可以用torch.load打开那个graph_data_all.pt文件,里面似乎是个字典{数据集名字:pyg格式图数据},把你的新图数据update进去就OK。然后你要生成新的instruction following数据集的时候应该是在那个id字段,比如arxivlp... 第一个_前的名字和你的整图数据对上就行,数据集处理的那几个类会有相应的处理。我记的不一定准确,但大约是这样。

linwhitehat commented 5 months ago

你可以用torch.load打开那个graph_data_all.pt文件,里面似乎是个字典{数据集名字:pyg格式图数据},把你的新图数据update进去就OK。然后你要生成新的instruction following数据集的时候应该是在那个id字段,比如arxivlp... 第一个_前的名字和你的整图数据对上就行,数据集处理的那几个类会有相应的处理。我记的不一定准确,但大约是这样。

十分感谢,我也发现了这个,但是我对graph_data_all.pt的更新还有疑问,里面似乎存储的是节点特征向量,作者好像没有介绍这块是怎么生成的。

serendipity800 commented 5 months ago

你可以用torch.load打开那个graph_data_all.pt文件,里面似乎是个字典{数据集名字:pyg格式图数据},把你的新图数据update进去就OK。然后你要生成新的instruction following数据集的时候应该是在那个id字段,比如arxivlp... 第一个_前的名字和你的整图数据对上就行,数据集处理的那几个类会有相应的处理。我记的不一定准确,但大约是这样。

十分感谢,我也发现了这个,但是我对graph_data_all.pt的更新还有疑问,里面似乎存储的是节点特征向量,作者好像没有介绍这块是怎么生成的。

我记得作者用128维的BERT编码了节点的文本信息作为初始node embeddings,你可以找一下这个仓库里另一个我提的issue,作者给了我认真的回复

linwhitehat commented 5 months ago

我记得作者用128维的BERT编码了节点的文本信息作为初始node embeddings,你可以找一下这个仓库里另一个我提的issue,作者给了我认真的回复

谢谢,请问是这个 #28 里的问题吗?我看作者给了一个bert的模型