Break benchmarks out into latency and bandwidth

Similar to "Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect". The bandwidth benchmarks already use cudaEvents to compute the bandwidth, but we could explicitly have a latency measurement, where the transfer size is minimal, and a bandwidth measurement, where the transfer size is larger.

Could break it out into two different benchmarks so the reporting is easier to understand.

c3sr / comm_scope

Break benchmarks out into latency and bandwidth #38