AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
542 stars 346 forks source link

Lessen code impact on Box class for GPU code #214

Closed maxpkatz closed 5 years ago

maxpkatz commented 6 years ago

GPU code currently has a large footprint on the Box class, due to every routine that modifies the box coordinates resetting the shared_ptr. This code footprint should be reduced.

maxpkatz commented 6 years ago

The problem here is that we use a std::shared_ptr for the device version of lo and hi. The reason is so that when we initialize one Box from another, it has the same device data, and a separate copy to the device is thus unneeded. But this means every time we change either Box, say due to a growLo() or growHi(), we need to reset the shared_ptr to reflect the fact that they will now have different data.

maxpkatz commented 6 years ago

Since we use the default copy constructor, I think we cannot even bite the bullet and use a unique_ptr and always update the device data on construction, since that requires an explicitly written copy constructor that does this.

maxpkatz commented 6 years ago

One possibility that I will investigate is to always reinitialize the allocation at the time we call loVect() or hiVect(). It is possible that this will not be that expensive since we are using CArena under the hood for the allocation, so the cost is just the CPU work and not any CUDA API calls.

drummerdoc commented 6 years ago

Is there some way to have it be a plain old Box until just before it gets used in this context, and then a new kind of Box-like object is created that puts the data on the device sorta once and for all during these operations (and then is deleted afterward)? Showing my ignorance here, but it seems like the large footprint comes from trying to shove a square peg in a round hole in terms of what's need on the device, vs what's useful in all other parts of the code.

PS: A valid answer is "no, and it would take too long to explain basic GPU programming to you to explain why."

maxpkatz commented 6 years ago

Yes, @drummerdoc , that is a valid possibility. In this context, the idea I have is to use the MFIter to store the device copy of the data, and that will automatically handle the scoping for us (Box data will be deleted on MFIter destruction). I also plan to investigate that. The drawback is that it might require a little more end user code modification, since it means that every time we create a Box, the MFIter has to know about it somehow.

drummerdoc commented 6 years ago

Yeah, I was thinking about that part after I hit the 'enter' key. Could be tricky because occasionally we do a fair amount of Box logic inside the loop body. However, if the new thing looks a lot like a box, and if the only time the added stuff is done is the first time a loVect() call is made, e.g., maybe that's ok.

maxpkatz commented 6 years ago

I experimented with the first approach of always reinitializing device memory when we call loVect() but this doesn't work. I think the reason it doesn't work is that if we create a Box from another Box after calling loVect(), and then call loVect() on that new Box, it's going to reset the device memory for the first Box, which might still be needed. (For the same reason, I think it is correct that we would somehow have to do the thing only the first time loVect() is called if we tried Marc's idea.)

I'm not sure how to fix this -- this is a real problem with shared_ptr, but it doesn't appear I can use unique_ptr in Box (it doesn't compile for reasons I don't understand).

maxpkatz commented 6 years ago

I think instead what I'll explore is something I discussed with Weiqun a couple weeks back. If we go back to the 'old' method of passing lo(1), lo(2), lo(3), and hi(1), hi(2), and hi(3) separately, then we can pass them by value to Fortran, and this whole mess is avoided (since copy by value will be handled automatically by CUDA). The main drawback is that it ostensibly makes the Fortran interfaces messier, but Weiqun and I also thought up a solution to this -- we already need to create an interposing wrapper that launches the CUDA kernel (@zingale is going to work on the script that does this). That wrapper will do the work of converting those indices back into arrays, so that the Fortran interfaces remain clean.

drummerdoc commented 6 years ago

That seems like a good approach. I think we're all used to the idea of wrapping those in macros anyway....and thanks for not saying "automagically" :)

maxpkatz commented 5 years ago

Closing. I resolved this by always passing Box indices by value to the kernel, and Kevin improved this by adding the ability to boxes directly to kernels by value.