Current implementation of F and B prop cannot efficiently be scaled to a GPU because the model is structured as an "array of structs". The Operation structs will need to be refactored as a "struct of arrays" to make use of matrix matrix multiplication (instead of vector vector dot products) to be scaled to the GPU. The model class and all associated Tensor operation classes will need to be refactored to GPU hardware optimization.
Optimizations
Minimal memory allocation/deallocation calls
Miminal memcopy calls
Parsimony with the amount of memory stored on the device
TensorContraction-Product/Mean/VarMod/Max/Count
May involve restricting the structure of the graphs and types of Evolution operations that can be performed
Goals
algorithm for converting graphs to tensor operations
maximization of gpu acceleration
compatible with evolution algorithm
Notes from Hans at DTU compute
Implement a "global matrix"
keep values close in memory instead of contiguous
node keeps the indexes of the position in the global matrices
In order of decreasing time to execute: 1) memory allocation (particularly host memory allocation), 2) memcopy, 3) kernal launches
All calls to device(...) will launch a new cuda Kernal. Therefore it is important to launch as few kernals as possible with the maximum amount of work.
Eigen::GpuStreamDevice will destroy the stream once it goes out of scope
Overlapping kernals can only be achieved between kernal launches and memcopy by manually managing the streams and using pinned host memory
Cuda Thrust library can be used for asynchronous loop calls but if a tensor is involved, only minimal gains in performance will be met because individual calls to device for each tensor operation (and hence individual kernal launches) will be made. However, this may "unblock" the CPU to do other tasks.
Description
Current implementation of F and B prop cannot efficiently be scaled to a GPU because the model is structured as an "array of structs". The Operation structs will need to be refactored as a "struct of arrays" to make use of matrix matrix multiplication (instead of vector vector dot products) to be scaled to the GPU. The model class and all associated Tensor operation classes will need to be refactored to GPU hardware optimization.
Optimizations
Goals
Notes from Hans at DTU compute
References