DeepGraphLearning / graphvite

GraphVite: A General and High-performance Graph Embedding System
https://graphvite.io
Apache License 2.0
1.21k stars 151 forks source link

out of memory using node2vec on small network #81

Open mginabluebox opened 3 years ago

mginabluebox commented 3 years ago

Hello @KiddoZhu ,

I tried node2vec on a graph with 685,551 nodes using 2 GPU cards with 32 GB each and CPU memory size 120GB, but I kept getting out of memory errors. I was doing p = q = 1. When I treated the graph as directed with the same hyperparameters, it was able to produce embeddings but doesn't work when treating the graph as undirected (which is what I want to do).

Here's the log:

Graph<uint32>
------------------ Graph -------------------
#vertex: 685551, #edge: 1386002
as undirected: yes, normalization: no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[time] GraphApplication.load: 4.78096 s
[time] GraphApplication.build: 2.6862 s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<128, float32, uint32>
----------------- Resource -----------------
#worker: 2, #sampler: 46, #partition: 2
tied weights: no, episode size: 599
gpu memory limit: 32 GiB
gpu memory cost: 386 MiB
----------------- Sampling -----------------
augmentation step: 10, p: 1, q: 1
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: node2vec
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
/bin/bash: line 1: 1144912 Killed                  python ...
slurmstepd: error: Detected 1 oom-kill event(s) in step 1491949.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

So as you can see it went oom before the training could begin. The hyperparameters related to memory cost (episode size, batch size etc.) seems reasonable for a dataset of this scale here? I did set the augmentation step to 10 because this is a sparse graph. Do you have any suggestions on what might have gone wrong here? Thanks a lot!

KiddoZhu commented 3 years ago

Looks weird. I will check the memory usage of node2vec and response here later.

mginabluebox commented 3 years ago

@KiddoZhu Just want to follow up on this!

DaliaDawod commented 1 year ago

Did you know the solution to this?