Open Kublai-Jing opened 9 years ago
Hi,
This is a good catch. In fact, we do have a plan to add such kind of operators. But it is not on the top of our list right now.
For LSTM, it is pretty slow if you use lots of vectors, due to the inherent overhead of each cuda call. So currently, there is no good solution to that and it seems to be a limitation of minerva. You could do two things to speed up your program:
BTW, could you share with us your codes? We are very glad to take a look and try it ourselves to understand our system.
Best, Minjie
Thanks Minjie,
Thanks, Jing
auto cpu = MinervaSystem::CreateCpuDevice();
auto gpu = MinervaSystem::CreateGpuDevice(0);
MinervaSystem::SetDevice(cpu); // the initialization runs on CPU
int N = 600000;
int D = 128;
vector<NArray> A = vector<NArray>(N,NArray::Zeros({1,D}));
for(int i = 0 ; i < N ; i ++){
A[i] = NArray::Randn({1,D},0,0.05);
}
MinervaSystem::SetDevice(gpu); // other computations run on GPU
...
Best, Minjie
The code is actually just:
#include <minerva.h>
#include <cstring>
#include <cstdlib>
#include <iostream>
#include <chrono>
using namespace minerva;
using namespace std;
int main(int argc, char ** argv){
srand (time(NULL));
MinervaSystem::Initialize(&argc, &argv);
MinervaSystem &ms = MinervaSystem::Instance();
uint64_t cpuDevice = ms.CreateGpuDevice(0);
ms.SetDevice(cpuDevice);
int N = 600000;
int D = 200;
vector<NArray>V(N, NArray::Zeros({1,D}));
for(int i = 0 ; i < N ; i++){
V[i] = NArray::Randn({1,D},0,0.01);
}
cerr <<"syncing..."<<endl;
ms.WaitForAll(); // takes long
cerr <<"synced!!"<<endl;
}
compiled with
g++ -std=c++11 -DHAS_CUDA -O3 -fopenmp -I/usr/local/cuda-6.5/include -Iminerva/minerva -lminerva -lgomp -lcudnn test.cpp -o main
I am using CentOS and GPU is K40.
The LSTM layer is normal size with about 200 units, but the vocabulary size is large (600K as I said) and one more thing different from normal LSTM for language modelling. In my problem, there could be more than one words at each time step.
What I really wanna do is parameters update for LSTM. I've realized that my vocabulary is relatively large (about 500K) and having a vector of NArray is very inefficient (of which the reason I don't know). When trying to sync (by calling WaitForAll) for the following code:
int N = 600000; int D = 128; vector A = vector(N,NArray::Zeros({1,D}));
for(int i = 0 ; i < N ; i ++){
A[i] = NArray::Randn({1,D},0,0.05);
}
//calling ms.WaitForAll() here takes a long long time.....
it takes a long long time (maybe because pushing NArray into vector one at a time makes a lot of malloc calls on GPU, which is slow ?) So instead I am thinking about having a giant 2D matrix and do the following thing.
NArray A = NArray::Randn({N,D},0,0.01); NArray b = NArray::Randn({1,D},0,1); A[10] = A[10] + b; // update a
It seems that this feature is not supported? Any suggestion or comment is very appreciated...