Improve horizontal remapper to work on Device

Currently the horizontal remapper does all of the remapping calculations on Host and then copies to Device. Since the data is already on Device this means we have a Device->Host->Device workflow. This task is to improve the implementation to do the full remapping on Device.

Currently the mapping is broken up into RemapSegments which store

The target column (defined as row in the remap file)
A 1-D view of the global column ids in the source data used for mapping. (defined as col in the remap file)
A 1-D view of the local indices for the source data stored on this rank.
A 1-D view of the weights. (defined as S in the remap file) When applying the remap we loop over the set of segments stored on this rank. Furthermore when applying the remap to a view of 2D+ we have to loop over all the non-column indices. For simplicity, and with an interest to get something that worked by deadline, we adopted to do everything on Host to avoid repeated Host/Device copies, which slowed performance incredibly.

Potential Solution: We could store the target columns in a 1-D view as well, which is how they are currently stored in the data. We would likely want to keep a lot of the current structure in place to avoid out-of-memory issues when populating these arrays. But as the last step of initialization of the remapper we could unpack the segments to create these arrays. With 3 1-D views on Device it should be easier to set up a parallel reduce algorithm that can be done on Device.

We could also explore having a preprocess step that creates a 1-D view of the source data which replaces col so that the Kokkos loop won't have to look up data based on index. I'm not sure if this would actually improve performance, but is an option.

E3SM-Project / scream

Improve horizontal remapper to work on Device #1959