Closed J08nY closed 1 year ago
https://leimao.github.io/blog/CUDA-Stream/
https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
Last resource above interesting.
Using this in our case has to be clever, as we are trying to avoid hitting the GPU memory limit, which somewhat limits our parallelization (If you are at the same time copying input data chunk to the device, computing on some other input data chunk, and copying out the output data, then you need GPU memory space for 2 chunks and 1 output). I think the way forward is to make chunk size and number of streams configurable and then look into this space for viable configurations (stuff fits in memory) and their speed (GPU is saturated the most).
Currently, if not using the HDF5 trace set "inplace" functionality, all trace data is loaded into memory where it is operated on. This puts a limit on the size of the traceset. Using HDF5 this loading into memory can be delayed somewhat, but will likely happen when the data is read or computed on. Perhaps HDF5 or some other method of streaming the data could be used to allow operating on large tracesets both on the CPU/GPU as it seems that trace set size is the major bottleneck.