alibaba / graph-learn

An Industrial Graph Neural Network Framework
Apache License 2.0
1.28k stars 267 forks source link

Worker memory usage keeps increasing when running graphsage dist_train.py #43

Closed fff-2013 closed 4 years ago

fff-2013 commented 4 years ago

Problem description

When I run the graphsage dist_train.py(cora data), the worker memory usage keeps increasing:

image

When I train model with our own data, which is a larger graph, the memory usage grows faster:

image

I guess if there is any memory leak? May be that some objects of the previous iterations are not free? Any advice or suggestions will be greatly appreciated.

Environment information for cora data

docker image: registry.cn-zhangjiakou.aliyuncs.com/pai-image/graph-learn:v0.1-cpu

code path: /workspace/graph-learn/examples/tf/graphsage (in docker container)

config: 2ps, 2worker / batchsize: 32 / epoch: 40000000

fff-2013 commented 4 years ago

After delete req/res pointers in edge_sampler.py and neighbor_sampler.py, memory leak is gone.

diff --git a/graphlearn/python/sampler/edge_sampler.py b/graphlearn/python/sampler/edge_sampler.py
index 4103ca1..79ab314 100644
--- a/graphlearn/python/sampler/edge_sampler.py
+++ b/graphlearn/python/sampler/edge_sampler.py
@@ -83,6 +83,8 @@ class EdgeSampler(object):
                                   src_ids,
                                   dst_ids)
     edges.edge_ids = edge_ids
+    pywrap.del_get_edge_req(req)
+    pywrap.del_get_edge_res(res)
     return edges

diff --git a/graphlearn/python/sampler/neighbor_sampler.py b/graphlearn/python/sampler/neighbor_sampler.py
index 1757f12..b912e50 100644
--- a/graphlearn/python/sampler/neighbor_sampler.py
+++ b/graphlearn/python/sampler/neighbor_sampler.py
@@ -124,6 +124,8 @@ class NeighborSampler(object):
       current_batch_size = nbr_ids_flat.size

       src_ids = nbr_ids
+      pywrap.del_nbr_req(req)
+      pywrap.del_nbr_res(res)
     return layers

   def _make_req(self, index, src_ids):
@@ -200,4 +202,6 @@ class FullNeighborSampler(NeighborSampler):
       current_batch_size = nbr_ids_flat.size

       src_ids = nbr_ids
+      pywrap.del_nbr_req(req)
+      pywrap.del_nbr_res(res)
     return layers

When I want to provide feedback,i see this #46 , more modifications. Awesome!

jackonan commented 4 years ago

@fff-2013 Sorry to trouble you and thanks for pointing out the problem. We've fixed it and you can try again.