DeepGraphLearning / graphvite

GraphVite: A General and High-performance Graph Embedding System
https://graphvite.io
Apache License 2.0
1.22k stars 151 forks source link

Large network "std::bad_alloc" error #19

Closed PJthunder closed 4 years ago

PJthunder commented 5 years ago

I have tried the deepwalk on a large network (nearly 20M nodes) with a machine with 400G Ram and 4GPUs. Here is an error message: terminate called after throwing an instance of 'std::bad_alloc' terminate called recursively terminate called recursively what(): std::bad_alloc terminate called recursively

Is that due to I set the augmentation step too large so the CPU memory is not large enough?

KiddoZhu commented 5 years ago

It's the sample pools that cause the memory overflow. The auto value of episode_size will use a very large sample pool to maximize speed, regardless of memory issues. For 400GB RAM, if you use the default batch size of 1e5, I guess an episode size around 5k might be fine, although it will sacrifice a little speed compared to auto.

We're planning to add adaptive CPU memory mechanism, but it takes non-trivial effort.

PJthunder commented 5 years ago

Actually I am using the episode_size as 3500, the batch_size as 100000. Maybe it is still too large?

KiddoZhu commented 5 years ago

It's not so large. This is just about 100GB for sample pools, if num_partition is 4.

What dimension do you use for embeddings?

PJthunder commented 5 years ago

The dimension number is 128. I set num_partition as auto. Should I change the partition number manually?

KiddoZhu commented 5 years ago

Not necessary. auto is fine. Could you show me the log of hyperparameters by graphvite?

PJthunder commented 5 years ago

Just run with episode_size=500. It begins to run normally. Used about 250G memory. I can attach a ss of current log:

Screen Shot 2019-08-23 at 4 39 32 PM
KiddoZhu commented 5 years ago

Maybe my estimation isn't right. I just estimate that increasing 500 to 3500 will cause additional 77GB memory, which is still available. Anyway, you can continue with your current settings. We may check that in the future, if it really makes an issue.

Using 500 for such a scale has perceptible influence on the speed. You may set positive_reuse to 10, and then it equals to the speed of 5000. The performance may hurt a little bit, but the original DeepWalk just does like that, you know.

KiddoZhu commented 5 years ago

An additional tip. Since the graph is very dense, the augmentation step should be something like 1 or 2.

mginabluebox commented 3 years ago

Hello,

I'm also facing this issue with node2vec. My network has #vertex: 4119272, #edge: 94873549 and I'm running with 4 GPU cards and 360GB in CPU memory. I used episode_size = 100, batch_size= 100000:

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Graph<uint32>
------------------ Graph -------------------
#vertex: 4119272, #edge: 94873549
as undirected: yes, normalization: no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[time] GraphApplication.load: 72.6117 s
[time] GraphApplication.build: 4.31637 s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<128, float32, uint32>
----------------- Resource -----------------
#worker: 4, #sampler: 44, #partition: 4
tied weights: no, episode size: 100
gpu memory limit: 32 GiB
gpu memory cost: 1.07 GiB
----------------- Sampling -----------------
augmentation step: 10, p: 1, q: 1
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: node2vec
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/bin/bash: line 1: 1731210 Aborted 

I'm now trying with batch size = 10000. Is this because node2vec is more memory-consuming compared to DeepWalk?

KiddoZhu commented 3 years ago

@mginabluebox Yes. DeepWalk and LINE scales linearly w.r.t. |E|, but node2vec scales at least linearly w.r.t. |E|^2/|V| in the case of d-regular graphs, and can be worse if the degree distribution is skew.

I suggest trying DeepWalk or LINE instead of node2vec. The formers are more robust in terms of default hyperparameters. Node2vec won't bring significant gain unless you perform an exhausitive search of p and q on your dataset.