Using non-default streams in CUDA

Description of the problem

Currently, in #6 all the operations are issued to default stream. However, I was thinking that we can use non-default streams for issuing various kernels to different operations for their parallel execution. An example of such a situation is filling n vectors parallelly with fill_vector_kernel launched in n separate streams. In fact, one more example can be to fill n*m matrix with n or m kernels launched in separate streams. Before moving on to the implementation we can discuss the API for the above use case. Please comment below if you have thought of something. I will come up with the design soon. One more advantage of using non-default streams as claimed by https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ is the overlap of data transfers and kernel execution. However IMO, this isn't really useful for this library because it may be the case that user wants to copy back only small sized Vector to host and for that wasting time in creating streams isn't a good idea.

codezonediitj / adaboost

Using non-default streams in CUDA #2

Description of the problem

Example of the problem