Euler2 RGCN speed - Githubissues

dewang23 commented 3 years ago

Hi, I am using the RGCN implementation from examples directory on a custom dataset with 1137061 nodes and 58336927 edges. I have 6 node types and 67 edge types. There are no node features and 1 edge feature(which is equal to the edge type). The issue I am facing is that the training is extremely slow. The training was done on an n1-standard-64 machine of Google Cloud Platform (see here : https://cloud.google.com/compute/docs/machine-types) [64 cores, 240GB mem] I used the following parameters : layers = 1 num_negs = 2 lr = 0.01 optimizer = adam hidden_dim = 4 num_epochs = 1 embedding_dim 4 batch_size 1024 The training took total 183m51.489s. I have used very low settings here, and want to use higher settings like more dimensions, more number of epochs etc, but training time is an issue. Is such large training time expected for such kind of dataset? Or something is going wrong here? My training logs can be seen here => https://drive.google.com/file/d/1DyEPa9abK3X0UCiOqsemWZ8yxP5GjXBl/view?usp=sharing

zakheav commented 3 years ago

try this？：

Euler-2.0 euler内核打开多线程支持（可选） euler主要是分布式面向吞吐优化的框架，为了降低线程调度带来的额外开销，euler的内核是基于单线程开发的，导致单机用户在某些情况下有性能问题。因此可以尝试在euler项目的顶层 CMakeLists.txt 中，将：

option(USE_OPENMP "Option for using open mp" OFF)

设置为

option(USE_OPENMP "Option for using open mp" ON)

然后重新运行 build.sh 脚本。

ergouy commented 3 years ago

我想请教下为什么我按照教程安装完成之后找不到tf_euler这个包，还有一些其他包也找不到？

dewang23 commented 3 years ago

@zakheav I tried building euler with open mp on like you mentioned, but there is no improvement in training time. Here are the logs - For training (196m52.826s) : https://drive.google.com/file/d/1abcnW0ajZOjYasN-odmnSdrxlTjhIxKu/view?usp=sharing For building : https://drive.google.com/file/d/17Z7-hNsjyTSnHXWd8uQFnJkqHLMeo2-C/view?usp=sharing I installed tensorflow 1.12.0 from pip before building.

ergouy commented 3 years ago

@dewang23 铁子，有没有安装教程，我按照官方教程安装完总是少东西，求指点！

alinamimi commented 3 years ago

@zakheav I tried building euler with open mp on like you mentioned, but there is no improvement in training time. Here are the logs - For training (196m52.826s) : https://drive.google.com/file/d/1abcnW0ajZOjYasN-odmnSdrxlTjhIxKu/view?usp=sharing For building : https://drive.google.com/file/d/17Z7-hNsjyTSnHXWd8uQFnJkqHLMeo2-C/view?usp=sharing I installed tensorflow 1.12.0 from pip before building.

We implemented the basic version, which is related to relation number, so the speed is relatively slow there are many speed optimizations in the paper

dewang23 commented 3 years ago

You mean the RGCN paper, right? Do you mean Sec.2.2 Regularization?

alibaba / euler

Euler2 RGCN speed #282