node2vec Spark - memory issue

enricopal commented 7 years ago

Hi, I'm trying to run node2vec using the Spark implementation on a large graph (~2.8M nodes, ~41M edges, 4.1GB file), this is the command that I'm running:

./spark-submit --class com.navercorp.Main node2vec/node2vec_spark/target/node2vec-0.0.1-SNAPSHOT.jar --cmd node2vec --p 1 --q 1 --walkLength 40 --numWalks 5 --input yago_types.edgelist --output output/yago_types_p1_q1_l40_num5.emb --weighted False --directed False --indexed False

I get this error: "2017-05-09T16:45:13.259237677Z 17/05/09 16:45:13 ERROR scheduler.TaskSchedulerImpl: Lost executor 1 on spark-worker1-97711-prod: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages"

Everything was working fine with a smaller sample, so it seems like a memory problem to me. Have you ever experienced anything similar? Any clue on what could be a proper memory allocation for such a size of a graph? At the moment, I have a master node with 2GB and six workers with 42GB.

Thank you a lot! Enrico

august-yeom commented 7 years ago

Thank you for your post!

I solved additional problems. I will send a pull request soon.

Thank you! Ha-neul

aijianiula0601 commented 7 years ago

I have the same problem.Had it solved?

anbhat87 commented 4 years ago

I have the same problem. Facing continuous OOM with Node2vec for a directed graph. What is the recommendation to address this please?

aditya-grover / node2vec

node2vec Spark - memory issue #15