Closed PJthunder closed 4 years ago
It's the sample pools that cause the memory overflow. The auto
value of episode_size
will use a very large sample pool to maximize speed, regardless of memory issues. For 400GB RAM, if you use the default batch size of 1e5, I guess an episode size
around 5k might be fine, although it will sacrifice a little speed compared to auto
.
We're planning to add adaptive CPU memory mechanism, but it takes non-trivial effort.
Actually I am using the episode_size
as 3500, the batch_size
as 100000. Maybe it is still too large?
It's not so large. This is just about 100GB for sample pools, if num_partition
is 4.
What dimension do you use for embeddings?
The dimension number is 128. I set num_partition
as auto
. Should I change the partition number manually?
Not necessary. auto
is fine. Could you show me the log of hyperparameters by graphvite?
Just run with episode_size=500
. It begins to run normally. Used about 250G memory. I can attach a ss of current log:
Maybe my estimation isn't right. I just estimate that increasing 500
to 3500
will cause additional 77GB memory, which is still available. Anyway, you can continue with your current settings. We may check that in the future, if it really makes an issue.
Using 500
for such a scale has perceptible influence on the speed. You may set positive_reuse
to 10
, and then it equals to the speed of 5000
. The performance may hurt a little bit, but the original DeepWalk just does like that, you know.
An additional tip. Since the graph is very dense, the augmentation step should be something like 1 or 2.
Hello,
I'm also facing this issue with node2vec. My network has #vertex: 4119272, #edge: 94873549 and I'm running with 4 GPU cards and 360GB in CPU memory. I used episode_size = 100, batch_size= 100000:
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Graph<uint32>
------------------ Graph -------------------
#vertex: 4119272, #edge: 94873549
as undirected: yes, normalization: no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[time] GraphApplication.load: 72.6117 s
[time] GraphApplication.build: 4.31637 s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<128, float32, uint32>
----------------- Resource -----------------
#worker: 4, #sampler: 44, #partition: 4
tied weights: no, episode size: 100
gpu memory limit: 32 GiB
gpu memory cost: 1.07 GiB
----------------- Sampling -----------------
augmentation step: 10, p: 1, q: 1
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: node2vec
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
/bin/bash: line 1: 1731210 Aborted
I'm now trying with batch size = 10000. Is this because node2vec is more memory-consuming compared to DeepWalk?
@mginabluebox Yes. DeepWalk and LINE scales linearly w.r.t. |E|, but node2vec scales at least linearly w.r.t. |E|^2/|V| in the case of d-regular graphs, and can be worse if the degree distribution is skew.
I suggest trying DeepWalk or LINE instead of node2vec. The formers are more robust in terms of default hyperparameters. Node2vec won't bring significant gain unless you perform an exhausitive search of p
and q
on your dataset.
I have tried the deepwalk on a large network (nearly 20M nodes) with a machine with 400G Ram and 4GPUs. Here is an error message:
terminate called after throwing an instance of 'std::bad_alloc' terminate called recursively terminate called recursively what(): std::bad_alloc terminate called recursively
Is that due to I set the augmentation step too large so the CPU memory is not large enough?