dmccloskey / EvoNet

MIT License
2 stars 0 forks source link

Refactor for GPU optimization use of Tensors #71

Closed dmccloskey closed 5 years ago

dmccloskey commented 5 years ago

Description

Current implementation of F and B prop cannot efficiently be scaled to a GPU because the model is structured as an "array of structs". The Operation structs will need to be refactored as a "struct of arrays" to make use of matrix matrix multiplication (instead of vector vector dot products) to be scaled to the GPU. The model class and all associated Tensor operation classes will need to be refactored to GPU hardware optimization.

Optimizations

Goals

Notes from Hans at DTU compute

References

dmccloskey commented 5 years ago

references

https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_18e2fe2a3b264901816874516af12a097

notes from slides

  1. create streams until limit is reach or error is thrown by device
  2. use asyncEngineCount to test for Device can concurrently copy memory and execute a kernel
dmccloskey commented 5 years ago

Pinned vs unified memory

References:

dmccloskey commented 5 years ago

shared_ptr and cudaMalloc

dmccloskey commented 5 years ago

Results of CUDA testing

dmccloskey commented 5 years ago

Strategies to meet optimal GPU utilization

Common