Refactor for GPU optimization use of Tensors

dmccloskey commented 5 years ago

Description

Current implementation of F and B prop cannot efficiently be scaled to a GPU because the model is structured as an "array of structs". The Operation structs will need to be refactored as a "struct of arrays" to make use of matrix matrix multiplication (instead of vector vector dot products) to be scaled to the GPU. The model class and all associated Tensor operation classes will need to be refactored to GPU hardware optimization.

Optimizations

Minimal memory allocation/deallocation calls
Miminal memcopy calls
Parsimony with the amount of memory stored on the device
TensorContraction-Product/Mean/VarMod/Max/Count
May involve restricting the structure of the graphs and types of Evolution operations that can be performed

Goals

algorithm for converting graphs to tensor operations
maximization of gpu acceleration
compatible with evolution algorithm

Notes from Hans at DTU compute

Implement a "global matrix"
keep values close in memory instead of contiguous
node keeps the indexes of the position in the global matrices

References

https://gist.github.com/qfgaohao/0a285941c38cceb186fcaa464b349320

dmccloskey commented 5 years ago

references

https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_18e2fe2a3b264901816874516af12a097

notes from slides

create streams until limit is reach or error is thrown by device
use asyncEngineCount to test for Device can concurrently copy memory and execute a kernel

dmccloskey commented 5 years ago

Pinned vs unified memory

pinned memory offers faster data transfers, but may limit the host RAM, and also takes much longer to allocated pinned memory
unified memory is slower (but can be optimized with pre-fetching), allows for using more memory, but reduces stream concurrency

References:

dmccloskey commented 5 years ago

shared_ptr and cudaMalloc

dmccloskey commented 5 years ago

Results of CUDA testing

In order of decreasing time to execute: 1) memory allocation (particularly host memory allocation), 2) memcopy, 3) kernal launches
All calls to device(...) will launch a new cuda Kernal. Therefore it is important to launch as few kernals as possible with the maximum amount of work.
Eigen::GpuStreamDevice will destroy the stream once it goes out of scope
Overlapping kernals can only be achieved between kernal launches and memcopy by manually managing the streams and using pinned host memory
Cuda Thrust library can be used for asynchronous loop calls but if a tensor is involved, only minimal gains in performance will be met because individual calls to device for each tensor operation (and hence individual kernal launches) will be made. However, this may "unblock" the CPU to do other tasks.

dmccloskey commented 5 years ago

Strategies to meet optimal GPU utilization

Common

gpu/cpu resource management
host/device resources (weights, nodes, model errors, solver params)
synchronization of host/device resources
pre-determination of operation calls Option 1 Refactor FP, BP, Error, and WU to utilize device calls
refactoring the signature of FP, BP, Error, and WU exec methods
refactoring the signature of activationOp, integrationOp, solverOp
Need "TensorMultiMap" to minimize kernal calls Option 2
refactoring the signature of FP, BP, Error, and WU exec methods
refactoring the signature of activationOp, integrationOp, solverOp
Need "TensorContraction" methods for non sum integration methods

dmccloskey / EvoNet