Memory management improvements

lpisha commented 1 year ago

Allow the user--

through the library
through the command-line

--to specify parameters controlling the amount of memory used by bellhopcxx / bellhopcuda in all run types.

Currently:

In TL runs, the field size is determined by the environment file--either you run out of memory or you don't, nothing the user can really do here.
In arrivals runs, BELLHOP / BELLHOP3D had an unacceptable default for the amount of memory to use (16 GB). This has been changed to a user-settable parameter and the default reduced.
In ray runs and the writeout portion of eigenrays, at first it tries to allocate max-size rays for all rays which will be run. If there isn't enough memory for this, it allocates a large array, runs each ray in separate memory, and copies the valid portion of the ray into the main memory at the end.
In the "hits" portion of eigenrays, it just has a fixed parameter for the maximum.

There should be one parameter for how much memory the user wants to allow the program to use. It should count the SSP, bathymetry, etc. against this value too. Things like the maximum number of eigenray hits should also be adjustable.

xieziping commented 1 year ago

My question is that for the CUDA version of the program, the memory is allocated using cudaMallocManaged. The memory allocated in this way can be accessed by both the CPU side and the GPU side. However, for GPU, accessing this memory will be slow. I tested it with Nsight and found that SM utilization is less than 5% at present. Why not use CudaMalloc to allocate space on the GPU and use CudaMemCopy to copy? What is the reason for using cudaMallocManaged. Thank you very much. SM-Issue

lpisha commented 1 year ago

This is a tracking issue for changes we are implementing regarding allowing the user to specify how much memory an instance of bellhopcxx / bellhopcuda uses, to better support situations where the user wants to run multiple instances at the same time. Your question is not related to this, so it would have been preferred to open a new issue for your question.

On modern architectures (Pascal / 10x0 or later), unified virtual memory is generally not much slower, if at all, than explicitly managed memory. The GPU handles page faults and migrates those pages from CPU memory. This does make the kernel wait for those pages to be moved, but if you explicitly copy the memory in advance, you have to wait for those pages to be moved anyway. In some cases UVM could be faster: for example, suppose you have a transmission loss run, where all the rays go in one direction and never influence receivers on the other side. The pages holding the memory for the part of the field which is never touched by the GPU never have to be moved to the GPU at all! If they were explicitly managed, they'd have to be moved to the GPU once at the beginning and then back to the CPU at the end.

The reason we chose unified memory for this project is because it is one codebase which can be compiled as C++ or as CUDA. Simply changing the allocation between malloc and cudaMallocManaged is much cleaner to support, compared to having data structures have host and device pointers and explicit copy steps.

Regarding your observed performance: SM utilization of 3.8% will never be caused by unified virtual memory. At worst UVM might reduce performance in a particular case from 100% to 50% or something like that, but it will not lose 96% of the performance. Furthermore, how the memory was allocated doesn't change anything about how often DRAM is read and written. Too much DRAM activity means the data access patterns of the kernel itself are poor relative to the GPU caches, for example because a large region of the field is being read and written to, and this can't fit in the cache.

How many rays are being traced by the environment file you're using? What GPU are you using?

xieziping commented 1 year ago

Thank you for your patience. Maybe I'm using 1750 graphics card, which is not so good for new technology support. There are about 10000 rays. I sue Seamount3Dgaussian env which reduced the Nalpha and the Nbeta.

A-New-BellHope / bellhopcuda

Memory management improvements #15