Open lpisha opened 1 year ago
My question is that for the CUDA version of the program, the memory is allocated using cudaMallocManaged. The memory allocated in this way can be accessed by both the CPU side and the GPU side. However, for GPU, accessing this memory will be slow. I tested it with Nsight and found that SM utilization is less than 5% at present. Why not use CudaMalloc to allocate space on the GPU and use CudaMemCopy to copy? What is the reason for using cudaMallocManaged. Thank you very much.
This is a tracking issue for changes we are implementing regarding allowing the user to specify how much memory an instance of bellhopcxx
/ bellhopcuda
uses, to better support situations where the user wants to run multiple instances at the same time. Your question is not related to this, so it would have been preferred to open a new issue for your question.
On modern architectures (Pascal / 10x0 or later), unified virtual memory is generally not much slower, if at all, than explicitly managed memory. The GPU handles page faults and migrates those pages from CPU memory. This does make the kernel wait for those pages to be moved, but if you explicitly copy the memory in advance, you have to wait for those pages to be moved anyway. In some cases UVM could be faster: for example, suppose you have a transmission loss run, where all the rays go in one direction and never influence receivers on the other side. The pages holding the memory for the part of the field which is never touched by the GPU never have to be moved to the GPU at all! If they were explicitly managed, they'd have to be moved to the GPU once at the beginning and then back to the CPU at the end.
The reason we chose unified memory for this project is because it is one codebase which can be compiled as C++ or as CUDA. Simply changing the allocation between malloc
and cudaMallocManaged
is much cleaner to support, compared to having data structures have host and device pointers and explicit copy steps.
Regarding your observed performance: SM utilization of 3.8% will never be caused by unified virtual memory. At worst UVM might reduce performance in a particular case from 100% to 50% or something like that, but it will not lose 96% of the performance. Furthermore, how the memory was allocated doesn't change anything about how often DRAM is read and written. Too much DRAM activity means the data access patterns of the kernel itself are poor relative to the GPU caches, for example because a large region of the field is being read and written to, and this can't fit in the cache.
How many rays are being traced by the environment file you're using? What GPU are you using?
Thank you for your patience. Maybe I'm using 1750 graphics card, which is not so good for new technology support. There are about 10000 rays. I sue Seamount3Dgaussian env which reduced the Nalpha and the Nbeta.
Allow the user--
--to specify parameters controlling the amount of memory used by
bellhopcxx
/bellhopcuda
in all run types.Currently:
BELLHOP
/BELLHOP3D
had an unacceptable default for the amount of memory to use (16 GB). This has been changed to a user-settable parameter and the default reduced.There should be one parameter for how much memory the user wants to allow the program to use. It should count the SSP, bathymetry, etc. against this value too. Things like the maximum number of eigenray hits should also be adjustable.