want to pretrain on my own datasets

acbull / GPT-GNN

Code for KDD'20 "Generative Pre-Training of Graph Neural Networks"

MIT License

486 stars 87 forks source link

want to pretrain on my own datasets #21

Open Juicechen95 opened 4 years ago

Juicechen95 commented 4 years ago

Hi, acbull~ I think this algorithm is very interesting and I really want to test on my own graph dataset. It is there any advice or tips on how to prepare my own pretrain graph data? Thank you very much~~

acbull commented 4 years ago

Hi:

You can simply follow the similar paradigm of prepreocess*.py to parse your graph into our data formula, and then just run pretrain*.py over that parsed graph.

Or, if you want to merge our code into your own system, maybe you can rewrite the data structure, but everything else is similar.

Juicechen95 commented 4 years ago

Thank you very much for your advice, I will try it~

Juicechen95 commented 4 years ago

My own dataset contains more than 10 million nodes. I see in the paper that OAG dataset contains more than 178 million nodes, but I just find out that only about 1 million nodes are used for pretraining according to the pretrain_OAG.py, is that number right? I don't know why only small part of OAG are used for pretrain. And I really wonder how large a dataset this method can be used in because my dataset is very large. Is that realistic? Will I meet some internal storage problems or we can not sample out the small graph? Or do you have any idea about how to do pretraining on a very large dataset? I will be very grateful for your answer~~ Thanks a lot!!!

acbull commented 4 years ago

Since we utilize subgraph sampling during training, the size of the pretraining graph is not that matter. In experiments, I also try on the whole OAG dataset, but it's too big so I didn't provide it in google drive. But obviously, you can use our code to do pretraining on a super-large dataset.

Juicechen95 commented 3 years ago

Thank you for your patient reply~I will try it.