dmlc / minerva

Minerva: a fast and flexible tool for deep learning on multi-GPU. It provides ndarray programming interface, just like Numpy. Python bindings and C++ bindings are both available. The resulting code can be run on CPU or GPU. Multi-GPU support is very easy.
Other
698 stars 172 forks source link

indexing reference of NArray? #37

Open Kublai-Jing opened 9 years ago

Kublai-Jing commented 9 years ago

What I really wanna do is parameters update for LSTM. I've realized that my vocabulary is relatively large (about 500K) and having a vector of NArray is very inefficient (of which the reason I don't know). When trying to sync (by calling WaitForAll) for the following code:

int N = 600000; int D = 128; vector A = vector(N,NArray::Zeros({1,D})); for(int i = 0 ; i < N ; i ++){ A[i] = NArray::Randn({1,D},0,0.05); } //calling ms.WaitForAll() here takes a long long time.....

it takes a long long time (maybe because pushing NArray into vector one at a time makes a lot of malloc calls on GPU, which is slow ?) So instead I am thinking about having a giant 2D matrix and do the following thing.

NArray A = NArray::Randn({N,D},0,0.01); NArray b = NArray::Randn({1,D},0,1); A[10] = A[10] + b; // update a

It seems that this feature is not supported? Any suggestion or comment is very appreciated...

jermainewang commented 9 years ago

Hi,

This is a good catch. In fact, we do have a plan to add such kind of operators. But it is not on the top of our list right now.

For LSTM, it is pretty slow if you use lots of vectors, due to the inherent overhead of each cuda call. So currently, there is no good solution to that and it seems to be a limitation of minerva. You could do two things to speed up your program:

  1. One thing is like what you said, use a big 2D matrix other than lots of vectors which needs additional operators.
  2. Another way is to use CPU instead. CPU is not bad in this situation. You could try use CPU device or just use numpy.

BTW, could you share with us your codes? We are very glad to take a look and try it ourselves to understand our system.

Best, Minjie

Kublai-Jing commented 9 years ago

Thanks Minjie,

  1. By using CPU device, do you mean that we do the same operator[] with CPU? I've actually tried that, but it seems that the operator[] now returns value, not reference, so it really can't update the parameters that we want it to.
  2. For now my solution to speed up is to use plain float * data structure on CPU to hold the entire 2D parameter matrix, and convert them to and from NArray when necessary. This looks not optimal as we need to copy data back and forth between CPU and GPU, but it's the one on top of my head. After all, the most computationally expansive part should be the middle layer of LSTM (where dense mat-mat multiplication happens), and if that's on GPU, we should still be able to get some benefit from minerva. Any suggestion on this is appreciated.
  3. I'd more than happy to share the code snippet sometime later : )

Thanks, Jing

jermainewang commented 9 years ago
auto cpu = MinervaSystem::CreateCpuDevice();
auto gpu = MinervaSystem::CreateGpuDevice(0);
MinervaSystem::SetDevice(cpu); // the initialization runs on CPU
int N = 600000;
int D = 128;
vector<NArray> A = vector<NArray>(N,NArray::Zeros({1,D}));
for(int i = 0 ; i < N ; i ++){
    A[i] = NArray::Randn({1,D},0,0.05);
}
MinervaSystem::SetDevice(gpu); // other computations run on GPU
...

Best, Minjie

Kublai-Jing commented 9 years ago

The code is actually just:

#include <minerva.h>
#include <cstring>
#include <cstdlib>
#include <iostream>
#include <chrono>
using namespace minerva; 
using namespace std;
int main(int argc, char ** argv){
    srand (time(NULL));
    MinervaSystem::Initialize(&argc, &argv);
    MinervaSystem &ms = MinervaSystem::Instance();
    uint64_t cpuDevice = ms.CreateGpuDevice(0);
    ms.SetDevice(cpuDevice);
    int N = 600000;
    int D = 200;
    vector<NArray>V(N, NArray::Zeros({1,D}));
    for(int i = 0 ; i < N ; i++){
      V[i] = NArray::Randn({1,D},0,0.01);
    }
    cerr <<"syncing..."<<endl;
    ms.WaitForAll(); // takes long
    cerr <<"synced!!"<<endl;
}

compiled with

g++ -std=c++11 -DHAS_CUDA -O3 -fopenmp -I/usr/local/cuda-6.5/include -Iminerva/minerva -lminerva -lgomp -lcudnn  test.cpp -o main

I am using CentOS and GPU is K40.

The LSTM layer is normal size with about 200 units, but the vocabulary size is large (600K as I said) and one more thing different from normal LSTM for language modelling. In my problem, there could be more than one words at each time step.