Open AaronDonahue opened 2 years ago
Some other thoughts that could be debated.
Allow for a partial remap where a rank will only store a part of the row,col,S
triplet for a specific row
. This would be helpful for coarsening where the source data may reside on multiple ranks.
Extend support for packed arrays. Currently all views need to be Real, so we are losing performance there. Maybe not an issue for our GPU runs which default to packsize=1, but if we bump up packsize or run on CPU this would have an impact.
Currently the horizontal remapper does all of the remapping calculations on Host and then copies to Device. Since the data is already on Device this means we have a Device->Host->Device workflow. This task is to improve the implementation to do the full remapping on Device.
Currently the mapping is broken up into RemapSegments which store
row
in the remap file)col
in the remap file)S
in the remap file) When applying the remap we loop over the set of segments stored on this rank. Furthermore when applying the remap to a view of 2D+ we have to loop over all the non-column indices. For simplicity, and with an interest to get something that worked by deadline, we adopted to do everything on Host to avoid repeated Host/Device copies, which slowed performance incredibly.Potential Solution: We could store the target columns in a 1-D view as well, which is how they are currently stored in the data. We would likely want to keep a lot of the current structure in place to avoid out-of-memory issues when populating these arrays. But as the last step of initialization of the remapper we could unpack the segments to create these arrays. With 3 1-D views on Device it should be easier to set up a parallel reduce algorithm that can be done on Device.
We could also explore having a preprocess step that creates a 1-D view of the source data which replaces
col
so that the Kokkos loop won't have to look up data based on index. I'm not sure if this would actually improve performance, but is an option.