Wish clarification about two optimization strategies.

backyes commented 4 years ago

How to balance DNN computation on GPU and sampling compuation on cpu in graph-learn, if GPU is fast and data provided by CPU sampling is not fast enough? Generally, We will use latency hidden skill to prefetch and buffer samples that samped by CPU.

If this is not resolved, GPU will not be fully used in some situations.

https://github.com/alibaba/graph-learn/tree/master/graphlearn/core/operator/aggregator really works? After reading https://arxiv.org/pdf/1902.08730.pdf and these source codes, I really get confused with aggregator implementation. It seem that all GCN layers just use tensorflow native operators (https://github.com/alibaba/graph-learn/tree/master/graphlearn/python/model/tf/aggregators) instead of core/operator/aggregator to do aggregator computations described in Algorithm 1: GNN Framework https://arxiv.org/pdf/1902.08730.pdf.

Wish better clarification these trouble, thanks a lot.

baoleai commented 4 years ago

Good questions.

We are trying to parallelize sampling and make it asynchronous with the training process to improve GPU utilization. Reducing sampling time through message fusion can also improve GPU utilization in distributed mode.
Aggregator in core/operator/aggregator is a WIP feature, which will be used to optimize aggregation in distributed training through message fusion, see https://github.com/alibaba/graph-learn/issues/15.

Seventeen17 commented 4 years ago

I have rose a pr about Aggregator, fyi @backyes .

lorinlee commented 4 years ago

@baoleai Hi, is 'parallelizing sampling and make it asynchronous with training' already done or being working in progress? Thx~ And I'm confused about why not using tf.data.Dataset.prefetch to do sampling？I'm a beginner in tensorflow, maybe I have misunderstood this method.

YijianLiu commented 1 year ago

@baoleai Hi, is 'parallelizing sampling and make it asynchronous with training' already done or being working in progress? Thx~ And I'm confused about why not using tf.data.Dataset.prefetch to do sampling？I'm a beginner in tensorflow, maybe I have misunderstood this method.

Have you solved this problem? I am trying to use this method to do sampling

alibaba / graph-learn

Wish clarification about two optimization strategies. #20