Currently, in #6 all the operations are issued to default stream. However, I was thinking that we can use non-default streams for issuing various kernels to different operations for their parallel execution.
An example of such a situation is filling n vectors parallelly with fill_vector_kernel launched in n separate streams. In fact, one more example can be to fill n*m matrix with n or m kernels launched in separate streams.
Before moving on to the implementation we can discuss the API for the above use case.
Please comment below if you have thought of something. I will come up with the design soon.
One more advantage of using non-default streams as claimed by https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ is the overlap of data transfers and kernel execution. However IMO, this isn't really useful for this library because it may be the case that user wants to copy back only small sized Vector to host and for that wasting time in creating streams isn't a good idea.
Description of the problem
Currently, in #6 all the operations are issued to default stream. However, I was thinking that we can use non-default streams for issuing various kernels to different operations for their parallel execution. An example of such a situation is filling
n
vectors parallelly withfill_vector_kernel
launched inn
separate streams. In fact, one more example can be to fill n*m matrix withn
orm
kernels launched in separate streams. Before moving on to the implementation we can discuss the API for the above use case. Please comment below if you have thought of something. I will come up with the design soon. One more advantage of using non-default streams as claimed by https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ is the overlap of data transfers and kernel execution. However IMO, this isn't really useful for this library because it may be the case that user wants to copy back only small sizedVector
to host and for that wasting time in creating streams isn't a good idea.Example of the problem