Open mxmlnkn opened 8 years ago
This would speed up the library indeed. The main question is, after allocating the memory (probably the best place for this would be the task queue) should we pass
I guess the first option would be the harder one to implement but both versions should be benchmarked.
Looking at the benchmarks Mallocs may not be that much of a performance issue, and even if they were, the stream parallelism should cover it.
We could save some cudaMallocs and cudaFrees, if
taskQueue.cu
could do cudaMalloc in the initializer call where it also creates the work thread list. The pointers to the memory locations or the one large memory location could then be given to shrink wrap which in the current version calls cudaMalloc and cudaFree each time.It would make
cudaShrinkWrap
harder to call, so I would prefer to copy-paste it tocudaShrinkWrapBatch
, which could be called by the first after it allocates the needed memory.