Interleave IO operations with kernel calculation

I've been thinking about this. If you want to overlap these things you have to indeed ensure that streams are used so that computation in one stream can overlap with data transfers in other streams. It might be enough to use multiple threads, one for each stream. However, I know that in a single threaded application it is necessary to allocate host memory in a way that ensures that the cudamemcpy operations can be performed by DMA. It's the only way to make the async API calls truly asynchronous with respect to the host.

Perhaps you won't need it, because you will be using multiple threads, in which case it might not hurt performance when the cpu thread blocks on the cudamemcpyasync. But if you don't see any overlap between copies in one stream and copies (in the opposite direction) and computations in other streams then this could be the cause. Also, I expect the achieved bandwidth of cudamemcpy to increase significantly if you allocate host memory that is page-locked and aligned. But depending on how Eigen is coded it might require modifying Eigen to really achieve this, I haven't checked that.

NLESC-JCER / EigenCuda

Interleave IO operations with kernel calculation #23