Open Juicechen95 opened 4 years ago
Hi:
You can simply follow the similar paradigm of prepreocess*.py to parse your graph into our data formula, and then just run pretrain*.py over that parsed graph.
Or, if you want to merge our code into your own system, maybe you can rewrite the data structure, but everything else is similar.
Thank you very much for your advice, I will try it~
My own dataset contains more than 10 million nodes. I see in the paper that OAG dataset contains more than 178 million nodes, but I just find out that only about 1 million nodes are used for pretraining according to the pretrain_OAG.py, is that number right? I don't know why only small part of OAG are used for pretrain. And I really wonder how large a dataset this method can be used in because my dataset is very large. Is that realistic? Will I meet some internal storage problems or we can not sample out the small graph? Or do you have any idea about how to do pretraining on a very large dataset? I will be very grateful for your answer~~ Thanks a lot!!!
Since we utilize subgraph sampling during training, the size of the pretraining graph is not that matter. In experiments, I also try on the whole OAG dataset, but it's too big so I didn't provide it in google drive. But obviously, you can use our code to do pretraining on a super-large dataset.
Thank you for your patient reply~I will try it.
Hi, acbull~ I think this algorithm is very interesting and I really want to test on my own graph dataset. It is there any advice or tips on how to prepare my own pretrain graph data? Thank you very much~~