A-New-BellHope / bellhopcuda

CUDA and C++ port of BELLHOP / BELLHOP3D underwater acoustics simulator
GNU General Public License v3.0
63 stars 8 forks source link

Memory management improvements #15

Open lpisha opened 1 year ago

lpisha commented 1 year ago

Allow the user--

--to specify parameters controlling the amount of memory used by bellhopcxx / bellhopcuda in all run types.

Currently:

There should be one parameter for how much memory the user wants to allow the program to use. It should count the SSP, bathymetry, etc. against this value too. Things like the maximum number of eigenray hits should also be adjustable.

xieziping commented 1 year ago

My question is that for the CUDA version of the program, the memory is allocated using cudaMallocManaged. The memory allocated in this way can be accessed by both the CPU side and the GPU side. However, for GPU, accessing this memory will be slow. I tested it with Nsight and found that SM utilization is less than 5% at present. Why not use CudaMalloc to allocate space on the GPU and use CudaMemCopy to copy? What is the reason for using cudaMallocManaged. Thank you very much. SM-Issue

lpisha commented 1 year ago

This is a tracking issue for changes we are implementing regarding allowing the user to specify how much memory an instance of bellhopcxx / bellhopcuda uses, to better support situations where the user wants to run multiple instances at the same time. Your question is not related to this, so it would have been preferred to open a new issue for your question.

On modern architectures (Pascal / 10x0 or later), unified virtual memory is generally not much slower, if at all, than explicitly managed memory. The GPU handles page faults and migrates those pages from CPU memory. This does make the kernel wait for those pages to be moved, but if you explicitly copy the memory in advance, you have to wait for those pages to be moved anyway. In some cases UVM could be faster: for example, suppose you have a transmission loss run, where all the rays go in one direction and never influence receivers on the other side. The pages holding the memory for the part of the field which is never touched by the GPU never have to be moved to the GPU at all! If they were explicitly managed, they'd have to be moved to the GPU once at the beginning and then back to the CPU at the end.

The reason we chose unified memory for this project is because it is one codebase which can be compiled as C++ or as CUDA. Simply changing the allocation between malloc and cudaMallocManaged is much cleaner to support, compared to having data structures have host and device pointers and explicit copy steps.

Regarding your observed performance: SM utilization of 3.8% will never be caused by unified virtual memory. At worst UVM might reduce performance in a particular case from 100% to 50% or something like that, but it will not lose 96% of the performance. Furthermore, how the memory was allocated doesn't change anything about how often DRAM is read and written. Too much DRAM activity means the data access patterns of the kernel itself are poor relative to the GPU caches, for example because a large region of the field is being read and written to, and this can't fit in the cache.

How many rays are being traced by the environment file you're using? What GPU are you using?

xieziping commented 1 year ago

Thank you for your patience. Maybe I'm using 1750 graphics card, which is not so good for new technology support. There are about 10000 rays. I sue Seamount3Dgaussian env which reduced the Nalpha and the Nbeta.